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Background of the Invention 

Recent advances in genetic engineering and bioin formate have enabled the manipulation and characterization 
5 of large portions of tho human genome. While efforts tq obtain the full sequence of the human genome arc rapidly 

progressing, there are many practical uses for genetic information which can bo implemented with pnrtial knowledge of 
tho sequence of the human genome. 

As the full sequence of tho human genome Is assembled, the partial sequence information available con be used 
to identify genes responsible for detectable human traits, such as genes associated with human disuases, and to develop 
10 diagnostic tests capable of identifying individuals who express a detectable trait as the result of a specific genotype or 

individuals whose genotype places (hum at risk of developing a detectable trait at a subsequent time. Each uf these 
applications for partial genomic sequence information is based upon the assembly of genetic and physical maps which 
order the known genomic sequences along the human chromosomes. 

The presont invention relates to human genomic sequences which can be used to construct a high resolution 
15 map of the human genome, methods for constructing such a map, methods of identifying genes associated with 

detectable human traits, and diagnostics for identifying individuals who tarry a gene which causes them tu express a 
detectable trait or which places them at risk of expressing a detectable trait in the future. 



Summary of the Invention 

20 A first embodiment of the present invention is a method of obtaining a set of biailelic markers comprising the 

steps of obtaining a nucleic acid library comprising a plurality of genomic DNA fragment^ comprising the full genome or a 
portion thereof, determining the order of said plurality of genomic DNA fragments in the genome, determining the 
sequence of selected regions of said plurality of genomic DNA fragments, and identifying nucleotides in said plurality of 
genomic DNA fragments which vary between individuals, thereby defining a set of biailelic markers. 

25 In one aspect of this first embodiment, the identifying step comprises identifying about 20,000 hiallelic 

markers* In another aspect of this first embodiment, the identifying step comprises identifying about 40,000 biailelic 
markers. In a -further aspect of this embodiment, the identifying step comprises identifying about 60,000 biailelic 
markers. In still another aspect of this first embodiment, the identifying step comprises identifying about 80,000 
bialleUc markers. . in stfll another aspect of this first embodiment, the identifying step comprises identifying about 

30 100,000 biailelic markers- . In still another aspect of this first embodtmen the identifying step comprises identifying 

about 120,000 biailelic markers. 

In still another aspect of this first embodiment, the biailelic markers are separated from one another by an 
average distance of 10kh*2QO kb. * In still another aspect of this first embodiment, the biailelic markers arB separated 
from one another by an average distance of 15kb-15D kb. In still another aspect of this first embodiment the biailelic 
35 markers are separated from one another by an average distance of 20kb-100 kb. . In still another aspect of this first 

embodiment, the tialielic markers are separated from one another by an average distance of 1Q0kb-T50 kb. In still 
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another aspect of this first embodiment, the biallelic markers are separated from one another by an average distance of 
501 OOkb. . In still another aspect of this first embodiment, the biallelic markers are separated from one another by an 
average distance of 25 kb*50 kb. 

In still another aspect of this first embodiment, the step of determining the sequence of selected regions of 
said plurality of genomic DNA fragments comprises inserting fragments of said plurality of genomic DNA f ragniunts into 
a vector to generate a plurality of subclones and determining the sequence of a region of the inserts in said plurality of 
subclones or a subsot thereof- For example, in this aspect of the first embodiment, the step of determining the sequence 
of a region of said inserts or a subset thereof may comprise determining the sequence of one or both end regions of said 
inserts or a suhsut thereof, In this aspect of thn first embodiment, the step of determining the sequence of one or both 
end regions of said plurality of subclones comprises dctermininn the sequence of about 500 bases at each end of said 
subclones or a subset thereof. 

In still another aspect of this first embodiment, a set of about 10,000 to about 20,000 genomic DNA inserts 
with an average size between lOOkb and 300kb are ordered. In still another aspect of this first embodiment, a set of 
about 10,000 to about 30.000 genomic DNA inserts with an overoge si2e between lOOkb and 150 kb are ordered. In 
still another aspect of this first embodiment, a set of about 15,000 to abuut 25 f O0O genomic DNA inserts with an 
average size between 1 GOkb and 200 kb are ordered 

in still another aspect of this first embodiment, the identifying step comprises identifying between 1 and 6 
biallelic markers per genomic DNA fragment. In still another aspect of this first embodiment, the identifying step 
comprises identifying an average of 3 biallelic markers per genomic DMA insert. 

In still another aspect of this first embodiment, the genomic DNA fragments are in a Bacterial Artificial 
Chromosome. In still another aspect of this first embodiment, the genomic DNA fragments are in a Yeast Artificial 
Chromosome- 

In still another aspect of this first embodiment, the method further comprises determining the position of said 
biallelic markers along the genome or a portion thereof, In this aspect of the first embodiment, the step of determining 
the position of said biallelic markers along the genome or portion thereof may comprise determining the position of said 
biallelic markers along a chromosome. In this aspect of the first embodiment, the step of determining the position of 
said biallelic markers along the genome or portion thereof comprises determining the position of said biallelic markers 
along a subchromosomal region. 

In still another aspect of this first embodiment the method further comprises identifying biallelic markers 
which are in linkage disequilibrium with one another. In this aspect of the first embodiment, the method may further 
comprise optimizing the tntermarker spacing between said biallelic markers such that each identified marker is in linkage 
disequilibrium with at least one other identified marker. 

In still another aspect of this first embodiment, the portion of the genome comprises at least 200 kb of 
contiguous genomic DNA. In still another aspect of this first embodiment, the portion of the genome comprises at least 
300 kb of contiguous genomic DNA. In still another aspect of this first embodiment the portion of the genome 
comprises at least BOO' kb of contiguous genomic UNA. In still another aspect of this first embodiment, the portion of the 
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genome comprises at least 2 Mb of contiguous genomic DNA. In still another aspect of this first embodiment, the portion 
of the genome comprises at least 5 Mb of contiguous genomic DNA. in still another aspect of this first embodiment the 
portion of the genome comprises at least 10 Mb of contiguous genomic DNA. to still another aspect of this first 
embodiment, the portion of the genome comprises at tenst 20 Mb of contiguous genomic DNA. 

In still another aspect of this first embodiment, the method further comprises the step of identifying one or 
more groups of biallclic markers which arc in proximity to one another in the genome. In this aspect of the first 
embodiment, the biallottc markers in each of tries* groups may ha located within a genomic region spanning less than 
1kb. Alternatively, in this aspect of the first embodiment, the biallclic markers in each of these groups may be located 
within a genomic region spanning from 1 to 5kb. Alternatively, in this aspect of the first embodiment, the biallelic markers 
in each of these groups may he located within a genomic region spanning from 5 to 10kb. Alternatively, in this aspect of 
the first embodiment, the biallclic markers in each of these group* may bs located within a gonornic region spanning from 
10 to 25kb. Alternatively, in this aspect of the frrst embodiment the biallelic markers in each of these groups may be 
bcated within a genomic region spanning from 25 to 50kb, Alternatively, in this aspect of the first embodiment, the 
biallelic markers in each of these groups may be located within a genomic region spanning from 50 to 150kb. 
Alternatively, in this aspect of the first embodiment the bialtelic markers in each of these groups may be located within a 
genomic region spanning from 150 to 250kb„ Alternatively, in this aspect of the first embodiment, the biallclic markers in 
each of these groups may be located within a genomic region spanning from 250 to SQOkb. Alternatively, in this aspect of 
tho first embodiment, the biallclic markers in each of these groups may be located within a genomic region spanning from 
50Qkb to 1Mb. Alternatively, in this aspect of the first embodiment, the biallelic markers in each of these groups may he 
located within a genomic region spanning more than 1Mb. 

A second embodiment of the present invention is a method of obtaining a set pf bialtelic markers comprising the 
steps of obtaining a nucleic acid lihrary comprising genomic DNA fragments comprising tho full genome or a portion 
thereof, determining the sequence of selected regions of said genomic DNA fragments/ identifying nucleotides in said 
genomic DNA fragments which vary between individuals, thereby defining a set of biallelic markers, and 
determining the order of said biallelic markers along the genome or portion thereof. 

A third embodiment of the present invention is a set of biallclic markers obtained by the method of the first 
embodiment In one aspect of this third embodiment the markers in said set have a known genomic position. In another 
aspect of this third embodiment, the markers in said set have a known genomic relationship to one another. 

A fourth embodiment of ths present invention is a set of biallelic markers having a known relationship to one 
another and a known genomic position, said set of biallelic markers being obtained by the method of the first 
embodiment In one aspect of this fourth embodiment, the biallelic markers have heterozygosity rates of at least about 
0.1B. In another aspect of this fourth embodiment, the biallelic markers have heterozygosity rate of at least about 0.32. 
In still another aspect of this fourth embodiment, the biallelic markers have a heterozygosity rate of at least about 0.42, 
A fifth embodiment of the present invention is a map comprising an ordered array of at least 20,000 biallelic 
markers obtained by the method of the first embodiment In one aspect cf this fifth embodiment, the map comprises an 
ordered array of at least B0.000 biallelic markers obtained by the method of the first embodiment In another aspect of 
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this fifth embodiment the map comprises an ordered array of at least l20 f OQO biallelic markers obtained by the method 
of the first embodiment 

In enoliinr aspect of this fifth embodiment, biailcKc markers are distributed at an average marker density of 
one marker every 150kb, In a further aspect of this fifth embodiment, the biallelic markers are distributed nt nn average 
marker density cf one marker every 50 kb. In a further 8spect of this fifth embodiment, the biallelic markers are 
distributed at an average marker density of one marker every 25 kb. 

A sixth embodiment of the present invention is a method of identifying one Dr mora biallelic markers associated 
with a detectable trait comprising the steps of determininglhe frequencies of each allele of one or more biallelic 
markers obtained by the method of the first embodiment in individuals who express said detectable trait and individuals 
who do not express said detectable trait, and identifying one or more alleles of said one or mors biallelic markers which 
are statistically associated with the expression of said detectable trait. In one aspect of this sixth embodiment, the 
detectable trait is selected from the group consisting of disease, drug response, drug efficacy, and drug toxicity. In 
another aspect of this sixth embodiment, the phenotype of said individuals who express said detectable trait and the 
phenotype of said individuals who dn not express said detectable trait are readily distinguishable from one another. In 
still another aspect of this sixth embodiment the individuals who cipress said detectable trait and the individuals whu do 
not express said detectable trait arc selected from a bimodal phenotype distribution. In still another aspect of this sixth 
embodiment, the individuals who express said detectable trait are at one phenotypic extreme of the population and said 
individuals who da not express said detectable trait are at the other phenotypic extreme of the population. 

A seventh embodiment of the present invention is a method of identifying a haplotyps associated with n trait 
comprising the steps of obtaining nucleic acid samples from trait positive and trait negative individuals, determining 
the frequencies of (he alleles of each member of a group of biallelic markers obtained by thB method of the first 
embodiment which are known to ba located proximity to one another in the genome in said nucleic acid samples, and 
identifying a plurality of alleles of biallelic markers having a statistically significant association with said trarL In onB 
aspect of this seventh embodiment, the detectable trait is selected from the group consisting of disease, drug response, 
drug efficacy, and drug toxicity. 

In another aspect of this seventh embodiment, the biallelic markers in each of these groups are located within 
a genomic region spanning less than 1kb. In still another aspect of this seventh embodiment, the biallelic markers in each 
of these groups are located within a genomic region spanning from 1 to 5kb, In still another aspect of this seventh 
embodiment, the bisSelic markers in each of these groups are located within a genomic rsgion spanning from 5 to IDkb. . 
In stii another aspect of this seventh embodiment, the biallelic markers in each of these groups are located within a 
genomic region spanning from 10 to 25kb. . In still another aspect of this seventh embodiment, thB biallelic markers in 
each of these groups are located within a genomic region spanning from 25 to 5Qkb. In still another aspect of this seventh 
embodiment, the biallelic markers in each of these groups are located within a genomic region spanning from 50 to 
15Qkb. . In still another aspect of this seventh embodiment, the biallelic markers in each of these groups are located 
within a genomic region spanning from 150 to 25Qkb. In still another aspect of this seventh embodiment, the biallelic 
markers in each of these groups are located within a genomic region spanning from 250 to 500kb. In still another aspect 
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of this seventh embodiment, the bialielic marker* in each of these groups are located within a genomic region spanning 
from BOOkb to 1Mb, In still another aspoct of this seventh embodiment, the bialielic markers in each of these groups are 
located within a genomic region spanning more than 1Mb, 

An eighth embodiment of the present invention is a method of identifying one or more bialielic markers 
associated with a detectable trait comprising the steps of selecting a gene in which mutations result in a detectable trail 
or a gene suspected of being associated with a detectable trait and identifying one or more bialielic markers obtained by 
the method of Claim 1 within the genomic region harboring said gene which are associated with said detectahln trait. In 
one aspect of this eighth embodiment, the detectable trait is selected from the gtoup consisting of disease, drug 
response, drug efficacy, and drug toxicity. In anothur aspect of this eighth embodiment, the identifying step comprises 

determining the frequencies of said one or more bialfelic markers in individuals who express said detectable 
trait and individuals who da not express said detectable trait and identifying one or more bialielic markers which are 
statistically associated with the expression of said detectable trait 

A ninth embodiment of the present invention is an array of nucleic acids fixed to a support, said nucleic acids 
comprising at least 8 consecutive nucleotides, including the polymorphic nucleotide, of one or more bialielic markers 
obtained by the method of the first embodiment. In one aspect of this ninth embodiment, the nucleic acids comprise at 
least 15 consecutive nucleotides, including the polymorphic nucleotide, of at least five bialielic markers obtainad by the 
method of the first embodiment. In another aspect of this ninth embodiment, 

the nucleic acids comprise at least B consecutive nucleotides, including the polymorphic nucleotide, of at least ten 
bialielic markers obtained by the method of the first embodiment. 

A tenth embodiment of the present invention is an array of nucleic acids fixed to a support, said nucleic acids 
comprising at least 8 consecutive nucleotides, including the polymorphic nucleotide, af One or more groups of bialielic 
markers known to be located in proximity to one another in the genome. 

An eleventh embodiment of the present invention is an array of nucleic acids fixed to a support, said nucleic 
acids comprising amplification primers for generating an amplification product comprising at least 8 consecutive 
nucleotides, including the polymorphic nucleotide, of one or more bialielic markers obtained by the method of the first 
embodiment 

A twelfth embodiment of the present invnetion is an array of nucleic acids fixed to a support, said nucleic acids 
of comprising amplification primers for generating an amplification product comprising at least 15 consecutive 
nucleotides, including the polymorphic nucleotide, of one or more groups of bialielic markers known tn he located in 
proximity to one another in 1he genome. 

A thirteenth embodiment of the present invnetion is an array of nucleic acids fixed to a support, said nucleic 
acids comprising one or more microsequencing primers for determining the identity of the polymorphic base of one or 
more nucleic acids comprising at least 15 consecutive nucleotides, including the polymorphic nucleotide, of one or more 
bialielic markers obtained hy the method of the first embodiment 
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A fourteenth embodiment of the present invention is an array of nucleic odds fixed to a support, said nucleic 
nucleic acids comprising one or more microscquencing primers for determining the identity of the polymorphic bn.es of 
one or more groups of biailelic markers known to be located in proximity to one another in the genome. 

A fifteenth embodiment of the present invention is an array of nucleic acids fixed to a support, wherein said 
5 nucleic acids arc complementary to one or more mierosequencing primers for determining the identities of the 

polymorphic bases of one or more biallefic markers obtained by iho method of the first embodiment In one aspect of 
this fifteenth embodiment, the nucleic acids arc complementary to at least five mierosequencing primers for determining 
the identities of the polymorphic bases of at least five biallefic markers obtained by the method of the first embodiment 
In another aspect of this fifteenth embodiment the audeic acids are complementary to at least ten microsBtjuencing 
10 primers for determining the identities of the polymorphic bases of at least leu biailelic markers obtained by the method 

of the first embodiment. 

A sixteenth embodiment of the present invention is an array of nucleic acids fixed to a support, said nucleic 
acids comprising one or more nucleic acids complementary to one or more niicroseqtjencing primers for determining the 
identity of the polymorphic bases of one or more groups of biailelic markers known to be located in proximity to one 
1 5 another in the genome. 

Another aspect of the present invention is an array of any one of the tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein the members of each of said one or more groups of biailelic markers are located in physical 
proximity to one another on said support . 

Another aspect of the present invention is an array of any one of Claims of the tenth, twelfth, fourteenth or 
20 sixteenth embodiments, wherein said biailelic markers in each of these groups are located within a genomic region 

spanning less than 1kb. * " 

Another aspect of the present invention is an array of any one of of the tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein said biailelic markers in each of these groups are located within a genomic region spanning from 1 
to 5kb. 

25 Another aspect of the present invention is an anay of any one of of the tenth, twelfth, fourteenth or sixteenth 

embodiments, wherein the biailelic markers in each of these groups are located within a genomic region spanning from 5 
tolQkb. 

Another aspect of the present invention is an array of any one of of the tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein the biailelic markers in sach of these groups are located within a genomic region spanning from 
30 I0to25kb. 

Another aspect of the present invention is an anay of any one of of the tenth, twelfth, f ourteenth or sixteenth 
embodiments, wherein the biailelic markers in each of these groups are toted within a genomic region spanning from 
25 to 50kb. 

Another aspect of the present invention is an array of any one of nf the tenth, twelfth, fourteenth or sixteenth 
35 embodiments, wherein the biailelic markers in each of these groups ere located within a genomic region spanning frnm 

SOtolSOkb. 
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Another aspect of the present invention is an array of any one of of tho tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein the biallelic markers in each of these groups are located within a genomic region spanning from 
150 toZSOkb. 

Another aspect of the present invention is an array of any one of of the tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein the biaBcJic markers in each of these groups are located within a genomic region spanning from 
250 toBOOkb. 

Another aspect of the present invention is en array of any one of of the tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein the biallelic markers in each of these groups are located within a genomic region spanning from 
5Q0kbto1Mb. 

Another aspect of the present invention is an array of any one of of the tenth, twelfth; f ourteenth or sixteenth 
embodiments, wherein the biallelic markers in each of "these groups are located within a genomic region spanning more 
than 1Mb. 

Another aspect of the present invention is an array of any one of af the tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein each group of hialiefic markers comprises at least 3 biallelic markers. 

Another aspect of the present invention is an array of any one of of the tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein each group of biaiicfic markers comprises at least 6 biallelic markers* 

Another aspect of the present invention is an array of ony one of of the tenth, twelfth, fourteenth or sixteenth 
embodiments, wherein each group of biallelic markers comprises at least 20 biallelic markers. 

A seventeenth embodiment of the present invention is a method for determining whether an. individual is at risk 
of developing a detectable trait or suffers from a detectable trait associated with said trait comprising the steps of 
obtaining a nucleic acfd sample from said individual, screening said nucleic acid sample with' one or more biallelic markers 
obtained by the method of the first embodiment, and determining whether said nucleic acid sample contains one or mare 
of biallelic markers statistically associated with said detectable trait. I ona aspect of this seventeenth embodiment, the 
detectable trait is selected from the group consisting of disease, drug response, drug efficacy and drug toxicity, in 
another aspect of 1his seventeenth emobiment tho biallelic markers were obtained by the method of the sixth 
embodiment In another aspect of this seventeenth embodiment, the biallelic markers were obtained by the method of 
the eighth embodiment 

An eighteenth embodiment of the present invention is a method of using a drug comprising obtaining a nucleic 
acid sample- from an individual, determining the identity of the polymorphic baso of one or more biallelic markers obtained 
by the method of the first embodiment which is associated with a positive response to treatment with said drug or one 
or mora biallelic markers obtained by the method of the first embodiment which is associated with a negative response 
to treatment with said drug, and administering said drug to said individual if said nucleic acid sample contains one or 
more biallelic markers associated with a positive response to treatment with said drug or if said nucleic acid sample 
lacks one or more biallelic markers associated with a negative response to said drug. In one aspect of this eighteenth 
embodiment the determining step comprises determining the identity of the polymorphic base of one or more biallelic 
markers obtained by the method of the aspect of the sixth embodiment wherein the trait is drug rasponse which is 
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associated with a positive response to treatment with said drug or one or more faiallelic markers obtained by the aspect 
of the sixth embodiment wherein the trait is drug response which is associated with a negative response to treatment 
with said drug, tn another aspect of this eighteenth embodiment the determining step comprises determining the 
identity of the polymorphic bnse of one or more biallelic markers obtained by (he aspect of the eighth embodiment 
wherein the trait is drug response which is associated with a positive response to treatment with said drug or one or 
more biallelic markers obtained by the method of the aspBCt of the eighth embodiment wherein the trait is drug response 
which is associated with a negative response to treatment with said drug. 

A nineteenth embodiment of the present invention is a method of selecting an individual for inclusion in a 
clinical trial of a drug comprising obtaining a nucleic acid sample from an individual, determining the identity of the 
polymorphic base of ono or more biallolic markers obtained by the method of the first embodiment which is associated 
with a positive response to treatment with said drurj or one or more biallelic markers associated with o negative 
response to treatment with said drug »n said nucleic add sample, and including soid individual in said clinical trial if said 
nucleic acid sample contains one or more biallelic markers obtained by the method of the first embodiment which is 
associated with a positive response to treatment with saki drug or if said nucleic acid sample lacks one or more biallelic 
markers associated with a negative response to said drug, In one aspect of this nineteenth embodiment, the dutermining 
step comprises determining the identity of the polymorphic base of one or more biallelic markers obtained by the aspect 
of the sixth embodiment wherein the trait is drug response which is associated with a positive response to treatment 
with said drug or one or more biallelic markers obtained by the aspect of the sixth embodiment wherein the trait is drug 
rcspons which is associated with a negative response to treatment with said drug. In another aspect of this nineteenth 
embodiment the determining step comprises determining the identity of the polymorphic basa of one or more biallelic 
markers obtained by the aspect of the.cighth embodiment wherein the trait is drug response which is associated with a 
positive response to treatment with said drug or one or more biallelic markers obtained by the aspect of the eighth 
embodiment wherein the trait is drug response which is associated with a negative response to treatment with said 
drug. 

A twentieth embodiment of the present invention is a method of identifying a gene associated with a 
detectable trait comprising the steps of determining the frequency of each allele of one or more biallelic markers 
obtained by the method of the first embodiment in individuals having said detectable trait and individuals lacking said 
detectable trait identifying one or more alleles of one or more biallelic markers having a statistically significant 
association with said detectable trait, and identifying a gene in linkage disequilibrium with said one or mora alleles. 
In one aspect of this twentieth embodiment, the method further comprises identifying a mutation in the gene which is 
associated with said detectable trait. In another aspect of this twentieth embodiment, the detectabls trait is selected 
from the group consisting of disease, drug response, drug efficacy, and drug toxicity. 

A twenty-first embodiment of the present invention is a method of identifying a gene associated with a 
detectable trait comprising selecting a gena suspected of being associated with a detectable trait and identifying 
one or more biallelic markers obtained by the method of the first embodiment within the genomic region harboring said 
gene which are associated with said detectable trait. In one aspect of this twenty-first embodiment, the detectable trait 
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is selected from tha group consisting of disease, drug response, drug efficacy, and drug toxicity. In another aspect of 
this twenty-first embodiment, the identifying step comprises determining the frequencies of said one or mors biailelic 
markers in individuals who express said detectable trait and individuals who do not express said delectable trait on J 

identifying one or more biailelic markers which are statistically associated with tha expression of said 
detectable trait. 

A twenty-second embodiment of the present invention is a methud of identifyinij a haptotype associated with 
a trait comprising the steps of obtaining nucleic acid samples from trait positive 3nd trait negative individuals, 

conducting an amplification reaction on said nucleic acid samples using amplification primers capable of 
generating amplification products containing the polymorphic bases of a plurality of biailelic markers, contacting one or 
more arrays according to the tenth embodiment with said amplification products. determining the identities of the 
polymorphic bases of said amplification products, and identifying a hapiotype having a statistically significant 
association with said trait. 

A twenty-third embodiment of the present invention is a method of identifying a hapiotype associated with a 
trait comprising the steps of obtaining nucleic acid samples from trait positive and trait negative individuals, conducting 
amplification reactions on said nucleic acid samples using amplification primers capable of generating amplification 
products containing the polymorphic bases of a plurality of biailelic markers, contacting one or more arrays according to 
the fourteenth embodiment with said amplification products, conducting microsequencing reactions on said 
amplification products using microsequencing primers on said arrays, thereby generating elongated microsequencing 
primers comprising the polymorphic bases of said amplification products, determining the identities of said polymorphic 
bases, and identifying a hapiotype having 3 statistically significant association with said trait. 

A twenty-fourth embodiment of the present invention is a method of identifying a haplntypc associated with a 
trait comprising the steps of obtaining nucleic acid samples from trait positive and trait negative individuals, conducting 
amplification reactions on said nucleic acid samples uisna amplification primers which are capable of generating 
amplification products containing the polymorphic bases of a plurality of biailelic markers, conducting microsequencing 
reactions on said nucleic acid samples, thereby generating microsequencing products containing the polymorphic bases 
of one or more biailelic markers at their 3' ends, said polymorphic bases being detectably labeled, contacting one or more 
arrays according to the sixteenth embodiment with said microsequencing products such that said microsequencing 
products specifically hybridize to said nucleic acids complementary to said microsequencing primers, determining 
the identities of the polymorphic bases of said microsaquancing products, and identifying a hapiotype having a 
statistically significant association with said trait. 

A twenty-fifth embodiment cf the present invention is a method of identifying a hapiotype associated with a 
trait comprising the steps of obtaining nucleic acid samples from trait positive and trait negative individuals, contacting 
on8 or more arrays according to the twelfth embodiment with said nucleic acid sample, conducting on amplification 
reaction on said nucleic acid samples using amplification primers on said array which are capable of generating 
amplification products containing the polymorphic bases of a plurality of biailelic markers, determining the identities of 
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the polymorphic bases of said amplification products, and identifying a haplotype having a statistically significant 
association with said trait 

A twenty-sixth embodiment of the present invention is a method of determining whether an individual is at risk 
of developing Alzheimer's disease or whether the individual suffers from Alzheimer's disease as a result of possessing 
the Apo E e4 Site A allele comprising obtaining a nucleic acid sample from said individual, and determining the identity 
of the polymorphic base in one or more of the sequences selected from the group consisting of SEQ ID Nos. 301-305 and 
SEQ ID Nos. 307-31 1 or the sequences complementary thereto in said nucleic acid sample. In one aspect of this twenty- 
sixth embodiment, the method further comprises determining whether said nucleic acid sample contains the sequence of 
SEQ ID No. 3DG or the sequence complementary thereto. In another aspect of this twenty-sixth embodiment, the step of 
determining the identity of the polymorphic bases in one or more of the sequences selected from the group consisting of 
SEQ ID Nos. 301*305 and SEQ ID Nos. 307-311 or the sequences complementary thereto comprises determining 
whether said nucleic acid sample contains the sequence of SEQ ID NO. 311 (the T allele of marker 99-365/344) or the 
sequence complementary thereto. In another version of the preceding aspect, the further comprises deterrnininy whether 
said nucleic acid sample contains the sequence of SEQ ID No. 30G or the sequence complementary thereto. 

A twenty-seventh embodiment of the present invention Is an isolated nucleic arid comprisinu 3 sequence 
selected from the group consisting of SEQ ID No. 301, SEQ ID No. 307, the sequences complementary thereto, and 
fragments comprising at least 8 consecutive nucleotides, including the polymorphic nucleotide, thereof* 

A twenty-eighth embodiment of the present invention is an isolated nucleic acid comprising a sequence 
selected from the group consisting of SEQ ID No. 302 , SEQ 10 No. 308, the sequences complementary thereto, and 
fragments comprising at least 8 consecutive nucleotides thereof. 

A twenty-ninth embodiment of the present invention is an isolated nucleic acid comprising a sequence selected 
-from the group consisting of SEQ ID No. 303, SEQ ID No. 309, the sequences complementary thereto, and fragments 
comprising at least 8 consecutive nucleotides, including the polymorphic nucleotide, thereof. 

A thirtieth embodiment of the present invention is an isolated nucleic acid comprising a sequence selected from 
the group consisting of SEQ ID No. 304, SEQ 10 No. 310 , thE sequences complementary thereto, and fragments 
comprising at least 6 consecutive nucleotides, including the polymorphic nucleotide, thereof. 

A thirty first embodiment of the present invention is an isolated nucleic acid comprising 3 sequence selected 
from the group consisting of SEQ ID No. 305, SEQ ID No. 311, the sequences complementary thereto, and fragments 
comprising at least B consecutive nucleotides, including the polymorphic nucleotide, thereof. 

A thirty second embodiment of the present invention is an isolated nucleic acid comprising a sequence selected 
from the group consisting of SEQ ID Nos* 313-317, SEQ ID Nos. 319-323, and fragments comprising at least 8 
consecutive nucleotides thereof, 

A thirty third embodiment of the present invention is isolated nucleic acid comprising a sequence selected from 
the group consisting of SEO ID Nos. 325-329, SEQ ID Nos. 331-335, the sequence complementary thereto, and 
fragments comprising at least 8 consecutive nucleotides thereof. 
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A thirty fourth embodiment of the present invention is set of nucleic acids comprising at least B consecutive 
nucleotides, including the polymorphic nucleotide, of one or more bialleiic markers obtained by the method of the first 
embodiment. 

A thirty fifth embodiment of the present invention is a set of nucleic acids comprisinrj amplification primers for 
generating an amplification product comprisinrj at least 8 consecutive nucleotides, including the polymorphic nucleotide 
of one or more uiallelic markers obtained by the method of the first embodiment, 

A thirty sixth embodiment of the present invention is a set of nucleic acids comprising one or more 
microsoquenciny primers for dulunmning the identity of the polymorphic base of one or more nucleic acids enmprisinn at 
least 8 consecutive nucleotides, including the polymorphic nucleotide, of one or more bialleiic markers obtained by the 
method of the first embodiment. 

Brief Description rif the Drawings 
Figure 1 is a cytogenetic map of chromosome 21. 

Figure 2a shows the results of a computer simulation of the distribution of inter-markcr spacing on a randomly 
distributed set of bialleiic markers indicating the percentage of bialleiic markers which will be spaced a given distance 
apart for 1, 2, or 3 markers/BAC in a genomic map {assuming 3 set of 20,000 minimally overlapping BACs covering the 
genome are evaluated). 

Figure 2b shows the results of a computer simulation of the distribution of intcr-morker spacing on a randomly 
distributed set of hiallelic markers indicating the percentage of bialleiic markers which will bB spaced a given distance 
apart for 1, 3, or 6 markersJBAC in a genomic map (assuming a set of 20,000 minimally overlapping BACs covering the 
genome are evaluated). * " 

Figure 3 shows, for a series of hypothetical sample sizes, the p-value significance obtained in associnlion 
studies performed using individual markers from the high-density bialleiic map, according to various hypotheses regarding 
the difference of allelic frequencies between the T+ and T- samples. 

Figure 4 is a hypothetical association analysis conducted with a map comprising about 3,000 bialleiic markers. 

Figure 5 is a hypothetical association analysis conducted with a map comprising about 20,003 bialleiic 

markers. 

Figure 6 is a hypothetical association analysts conducted with a map comprising about B0 r 00Q bialleiic 

markors. 

Figure 7 is a haplotype analysis using bialleiic markers in the Apo E region. 

Figure B is a simulated haplotype analysis using the bialleiic markers in the Apo E region included in the 
haplotype analysis of Figure 7. 

Figure 9 shows a minimal array of overlapping clones which was chosen for further studies of bialleiic markers 
associated with prostate cancer, the positions of STS markers known to map in the candidate genomic region along the 
contig, and the locations of bialleiic markers along the BAG conttg harboring a genomic region harboring a candidate gene 
associated with prostate cancer which were identified using the methods of the present invention. 
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Figure 10 is 8 rough localization of a candidate gene for prostate cancer which was obtained by determining 
the frequencies of the biallelic markers of figure 9 in affocted and unaffected populations. 

Figure 1 1 is a further refinement of the localization of the candidate genu for prostate cancer using additional 
biallelic markers which were not included in the rough localization illustrated in Figure 10. 

Figure 12 is a haplutypa analysis using the biallulic markers in the genomic region of the gene associated with 
prostate cancer. 

Figure 13 is a simulated haplotypc using the six markers included in haplotype 5 of Figure 12 



Detailed Description nf the Preferred Fmhndiment 
The human haploid genome contains an estimated 80,000 to 100,000 or more genus scattered on a 
3 x 10 9 base-long double stranded DNA shared among the 24 chromosomes. Each human being is diploid, Lo. possesses 
two haploid genomes, one from paternal origin, the other from maternal origin. The sequence of the human genome 
varies among individuals in a population. About 10 sites scattered along the 3x10 base pairs of DNA are polymorphic, 
existing in at least two variant forms called alleles. Most of these polymorphic sites are generated by single base 
substitution mutations and are biallelic. Loss than 10 s polymorphic sites are due to more complex changes and are very 
often muiti-allclic, ue. exist in more than two allelic forms. At a given polymorphic site, any individual (diploid), can lie 
either homozygous (twice the same allele) or heterozygous (two different alleles). A given polymorphism or rare mutation 
can be either neutral (no effect on trait), or functional I.e. responsible for a particular genetic trait. 

Genetic Mans 

The first step towards the identification of genes associated with a detectable trait, such as a disease or any 
other detectable trait, consists in the localization of genomic regions containing trait-causing genes usinrj genetic 
mapping methods. The preferred traits contemplated within the present invention relate to fields of therapeutic interest; 
in particular embodiments, they will be disease traits and/or drug response traits, reflecting drug efficacy or toxicity. 
Traits can either be "binary*, a.g, diabetic vs. non diabetic, or "quantitative", e.g. elevated biood pressure. Individuals 
affected by a quantitative trait can be classified according to an appropriate scale of trait values, e.g. blood pressure 
ranges. Each trait value range can then be analyzed as a binary trait. Patients showing a trait value within one such 
range will be studied in comparison with patients showing a trait value outside of this range. In such a case, genetic 
analysis methods will be applied to subpopulations of individuals showing trait values within defined ranges. 

Genetic mapping involves the analysis of the segregation of polymorphic loci in trait 
positive and trait negative populations. Polymorphic loci constitute a small fraction of the human 
genome (less than 1%), compared to the vast tnajority of human genomic DNA which is identical in 
sequence among the chromosomes of different individuals. Among all existing human polymorphic 
loci, genetic markers can be defined as genome-derived polynucleotides which are sufficiently 
polymorphic to allow a reasonable probability that a randomly selected person will be heterozygous, 
and thus informative for genetic analysis by methods such as linkage analysis or association studies, 
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A genetic map consists of a collection of polymorphic markers which have been positioned on the human 
chromosomes. Genetic maps may be combined with physical maps, collections of ordered overlapping fragments of 
genomic DNA whose arrangement along the human chromosomes is known The optimal genetic map should possess 
the following characteristics: 

- the density of the genetic markers scattered along the genome should be sufficient to allow the identification and 
localization of any trait-related polymorphism, 

• each marker should have an adequate level of heterozygosity, so as la be informative in a large percentage of different 
incloses, 

- all markers should be easily typed on a routine basis r at a reasonable expanse, and in a reasonable amount of time, 

* the entire set of markers per chromosome should bo ordered in a highly reliable fashion. 

However, while the abuve maps are optimal, it will be appreciated that the maps of the present invention may 
be used in the the individual marker and haplotype association analyses described below without the necessity of 
determining the order of bialleiic markers derived from a single BAC with respect tn one another. 



The analysis of DNA polymorphisms has relied on the following types of polymorphisms, The first generation 
of genetic markers were restriction fragment length polymorphisms (RFLPs), single nucleotide polymorphisms which 
occur at restriction sites, thereby modifying the cleavage pattern of the corresponding restriction enzyme. Though the 
original methods used to typa RFLPs were material-, effort* and time-consuming, today these markers can easily be 
typed by PCR*based technologies, Since they are bialleiic markers (thuy present only two alleles, the restriction site 
being either present or absent), their maximum heterozygcsiiy is 0.5. The theoretical number of RFLPs distributed along 
the entire human genome is more than 10 , which leads to a potential average intcj-raarker distance of 30 kilobases. 
However, in reality the number of evenly distributed RFLPs which occur at a sufficient frequency in the population to 
make them useful for tracking of genetic polymorphisms is very limited. 

The second generation of genetic markers was VNTRs (Variable Number of Tandem Repeats), which can bu 
categorized as either minisatellites or micrasatellites, Minisatellites are tandemly repeated DNA sequences present in 
units of 5-50 repeats which are distributed along regions of the human chromosomes ranging from Q.1 to 20 kilobases in 
length. Since they present many possible alleles, their polymorphic informative content is very high. Minisatellites are 
scored by performing Southern blots to identify the number of tandem repeats present in a nucleic acid sample from the 
individual being tested. However, there are only 1Q 4 potential VNTRs that can be typed by Southern blotting. 

Microsatellltes (also called simple tandem repeat polymorphisms, or simple sequence length polymorphisms) 
constitute the most developed category of genetic markers. They include small arrays of tandem repeats of simple 
sequences (di-tri-tetra- nucleotide repeats) which exhibit a high degree of length polymorphism and thus a high level of 
"mformaiiveness. Slightly more than 5,0GQ microsatellites easily typed by PCR-derived technologies, have been ordered 
along the human genome (Dib et aL, Natura 380:152 (1996), the disclosure of which is incorporated herein by 
reference). 



Genetic Maps Based on RFLPs or VNTRs 



WO 99/04038 




PCT/IB98/01193 



•14- 



A number of these available microsateliitcs were used to construct integrated physical and genetic maps 
containing less than 5,000 markers. For exomple, CEPH (Chumakov et af„ NstursWT. 175-298 11995) and Cohen et nl., 
Nature 386: 690-701 (1993) , the disclosures of which arc incorporated herein by reference), and Whitehead Institute 
and G£n4thon (Hudson et ol., 1995), constructed genetic and physical maps covering 75% to 95% of the human genome, 
based on 2500 to 5000 microsatelGte markers. 

However, the number of easily typed informative markers in these maps was too small for the average 
distance between informative markers to fulfill the above-listed requirements for genetic maps. 

Bialleiic Markers 

Bialleiic markers are genome-derived polynucleotides which exhibit bialleiic polymorphism. As used herein, the 
term bialleiic marker means a bialleiic single nucleotide polymorphism, As used herein, the term polymorphism may 
include a single base substitution, insertion, or deletion. By definition, the lowest allele frequency of a bialleiic 
polymorphism is 1% (sequence variants which show allele frequencies below 1% are called rare mutations}. There are 
potentially mure than 10 ? bialleiic markers which can easily be typed by routine automated techniques, such as 
sequence- or hybridization-based techniques, out of which 10° are sufficiently informative for mapping purposes. 
However, a bialleiic marker will show a sufficient dugree of informativencss for use in genetic mapping only if the 
frequency of its less frequent allele is not less than about 10% (i.e. a heterozygosity rate of at least 0.18) (the 
heterozygosity rate for a biaUcfic marker is 2 P, (1-PJ, where is the frequency of allele a). Preferably, the frequency 
of the less frequent allele of the bialleiic markers in the present maps is at least 20% (i.e. a heterozygosity rate of at 
least 0.32). More preferably, the frequency of the less frequent allele of the bialleiic markers in the present maps is at 
least 30% (i.e. its heterozygosity rate is higher than about 0.42). 

Initial attempts to construct genetic maps based on non-RFLP bialleiic markers have focused on identifying 
bialleiic markers lying within sequence tagged sites (STS) r pieces of genomic DNA having a known sequence and 
averaging about 250 bases in length. More than 30,000 STSs havo been identified and ordered along the genome 
(Hudson et al„ Science 270:1945-1954 (1995); Schuler et aU Science 274:540-545 (1996), the disclosures of which 
are incorporated herein by reference). For example, the Whitehead Institute and Genethcn's integrated map contains 
15,086 STSs. 

These sequence tagged sites can be screened to identify polymorphisms, preferably Single Nucleotide 
Polymorphisms (SNPsl, more preferably non RFLP bialleiic markers therein. Generally polymorphisms arc identified by 
determining the sequence of the STSs in 5 to 1 0 individuals. 

Wang et al. (Cold Spring harbor laboratory: Abstracts of papers presented on genome Mapping and 
sequencmgyM (May 14-18, 19971 the disclosure of which is incorporated herein by reference! recently announced the 
identification and mapping of 750 Single Nucleotide Polymorphisms issued from the sequencing of 12,000 STSs from 
the Whttehead/MIT map, in eight unrelated individuals. The map was assembled using a high throughput system based 
on the utilization of DNA chip technology available from Affymetrix (Chee et al. f Science 274:610-614 (1996), the 
disclosure of which is incorporated herein by reference)* 
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However, according to experimental data and statistical calculations, less than one out of 10 of aJi STSs 
mapped today will contain an informative Single Nucleotide Polymorphism. This is primarily due to the short length of 
existing STSs (usually less than 250 bp). If one assumes ID 8 informative SNPs spread along the human genome, there 
would on average be one marker of interest every 3X1 0 9 /"!!}* 1 , i.e. every 3,000 hp. The probability thai one such marker 
is present on a 250 bp stmtcli is thus less than 1/1 0. 

Wlulo it coufd produce a high density map, the STS approach based on currently existing markers does not put 
any systematic effort into making sure that tho markers obtained are optimally distributed throughout the entire 
genome. Instead, polymorphisms are limited to those locations for which STSs are available. 

The even distribution of markers along the chromosomes is critical to the future sunciiss of genstic ahnlysus. 
In particular, a high density map having appropriately spaced markers is essential for conducting association studies on 
sporadic cases, aiming at identifying genes responsible for detectable traits such as those which are described beluw. 

As will be further explained below, genetic studies have mostly relied in the past on a statistical approach 
called linkage Analysis, which took advantage of microsatellite markers to study their inheritance pattern within families 
from which a sufficient number of individuals presented the studied traiL Because of intrinsic limitations of linkage 
analysis, which will be further detailed below, and because these studies necessitate the recruitment of adequate family 
pedigrees, they are not well suited to the genetic analysis of all traits, particularly those for which only sporadic cases 
are available (eg. drug response traits), or those which have a low penetrance within the studied population 

Association studies offer an alternative to linkage analysis. Combined with the use of a high density map of 
appropriately spaced, sufficiently informative markers, association studies, including linkage disequilibrium-based 
genome wide association studies,wili enable the identification of most genes involved in complex traits. 

The present invention relates to a method lor generating a high density linkage di$equilibmim*b3sed genetic 
map of the human genome which will allow the identification of sufficiently informative markers spaced at intervals 
which permit their use in identifying genes responsible for detectable traits using genome-wide association studies and 
linkage disequilibrium mapping. 

Construction of a Physical Map 
The first step in constructing a high density genetic map of biallelic markers is the construction of a physical 
map. Physical maps consist of ordered, overlapping cloned fragments of genomic DNA covering a portion of the genome, 
preferably covering one or all chromosomes. Obtaining a physical map of the genome entails constructing and ordering a 
genomic DNA library. 

Physical mapping in complex genomes such as the human genoma (3.000 Megabases) requires the construction 
of DNA libraries containing large inserts (on the order of 0.1 to 1 Megabase). It is crucial that such libraries be easy to 
construct, screen and manipulate, find that the DMA inserts ba stable and relatively free of chimerism. 

Yeast artificial chromosomes (YACs; Burke et aL, Science 236:006-812 11987), the disclosure of which is 
incorporated herein by reference) have provided an invaluabla tool in the analysis of complex genomas since their cloning 
capacity j$ extremely high (in the Mb range). YAC libraries containing large DNA inserts (up to 2 Mb) have been used to 
generate STS-content maps of individual chromosomes or of the entire human genome (Chumakov et a!. (1995), supra; 
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Hudson nt al. (1995), suprr. Cohen et al. Nature 366: 698-701 (1993; Chumakov et al., Ntim 359:380-387 (1992); 
Gammiil et aL Nature 377:299-319 (1995); Ooggctt et zUNstm 377:335-365 (1935): the disclosures of which are 
incorporated herein by reference). 

The present genetic maps may be constructed using currently available YAC genomic libraries such os the 
CEPH human YAC library as a starting rnatcriaL (Chumakov et aL (1895), supra). Alternatively, one may construct a 
YAC genomic library as described in Chumakov et al, 1995, the disclosure of which is incorporated heroin by reference, 
or as described below. 

Once a YAC genomic library has been obtained, the genomic DNA fragments [herein are ordered. Ordering may 
be porformcd directly on the genomic DNA in the YAC library. However, direct ordering uf YAC inserts is not preferred 
because YAC libraries often exhibit a high ratu of chimerism (40 to 50% of YAC claims contain fragments frmn more 
than one genomic region), often suffer from clonal instability within their genomic DNA inserts, and require tedious 
procedures to manipulate and isolate tho insert DNA. Instead, it is preferable to conduct the mapping and sequencing 
procedures required for ordering the genomic DNA in a system which enables the stable cloning of large inserts while 
being easy to manipulate using standard molecular biology techniques. 

Accordingly, it is preferable to done tho genomic DNA into bacterial single copy plasmids, for example BACs 
(Bacterial Artificial Chromosomes), rather than into YACs. Bacterial artificial chromosomes arc woll suited for usa in 
ordering genomic DNA fragments. BACs provide a low ratB of chimerism and fragment rearrangement, together with 
relative case of insert isolation. Thus BAC libraries are well suited to integrate genetic, STS and cytogenetic 
information while providing direct access to stable, rcadiiy-sequenceable genomic DNA. An example of bacterial artificial 
chromosome is the BAC cloning system of Shizuya et ah, which is cap3btc of stabty propagating and maintaining 
relatively largo genomic DNA fragments (up to 300 kb long) as single-copy piasmids in Exalt (Shizuya et al. t Proc. Nail. 
Acad. Scf. USA 89:8794*8797 (1992), the disclosure of which is incorporated herein by reference). 

Example 1 describes the construction of a BAC library containing human genomic DMA. It will be appreciated 
that the source of the genomic DNA, the enzymes used to digest the DNA r the vectors into which the genomic DNA is 
inserted, and the size of the DNA inserts which are cloned into said vectors need not bo identical to those described in 
Example 1 below* Rather, tha genomic DNA may be obtained from any appropriate source, may be digested with any 
appropriate enzyme, and may be cloned into any suitable vector, insert size may vary within any range compatible with 
the cloning system chosen and with the intended purpose of the library being constructed. Typically, using BAC vectors 
to construct DNA libraries covering the entire human genome, insert size may vary between 50kb and 30Q kb, preferably 
lOOkb and 200kb. 

Example 1 
Construction of a BAC library 
Three different human genomic DNA libraries were produced by cloning partially digested DNA from a human 
lymphoblastoid cell line (derived from individual N° 8445, CEPH families) into the pBeloBACll vector (Kim et a!., 
Genomics 34:213-218 (199B), the disclosure of which is incorporated herein by reference). One library was produced 
using a BamHI partial digestion of the genomic DNA from the lymphoblastoid ceil line and contains 110,000 clones 
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having an average insert size of 150 kb (corresponding to 5 human hapioid genome equivalents). Another library was 
prepared from a Hindtil partial digest ami corresponds to 3 human genorno equivalents with an average insert size of 
150kb. A third library was prepared from a Ndel partial digest and corresponds to 4 human genome equivalents with an 
average insert size of 1 BOkb. 

Alternatively, pm genomic ONA may bo inserted into BAC vectors which possess bnth a high copy number 
origin of replication, which facilitates the isolation of the vector ONA, and a low copy number origin of replication. 
Cloning of a genomic DMA Insert into the high copy number origin of replication inactivates the origin such that clones 
containing a genomic insert replicate at low copy number. The tow copy number of clones having a genomic insert 
therein permits t|fe inserts to be stably maintained. In addition, selection procedures may be designed which enable low 
copy number plasmids (i.e. vectors having genomic inserts therein) to be selected. Such vectors and selection procedures 
arc described m the U.S. Patent Application entitled 'High Tltroughput DNA Sequencing Vector* {GENSET.01 5A. Serial 
No. 09/058,^16), the disclosure of which is incorporated herein by reference 

It will be appreciated that the present methods may be practical using BAC vectors other than those of 
Shuuya et al. 11992, supra), or derived from those, or vectors other than BAC vectors which possess the pbnvB- 
described characteristics. 

To construct a physical map of the genome from genomic ONA libraries, the library clones have to he ordered 
clong the human chromosomes. In a prcfensd embodiment, a minimal subset of the ordered clones will then be chosen 
that completely covers the entire genome. 

For example the genomic DMA in the inserts of the above described BAC vectors are ordered using STS markers whose 
positions relative to one another and locations along the genome arc known using procedures such as those described 
herein. The STS markers used to order the BAC inserts may be the STS markers*contained in the integrated maps 
described above. Alternatively, the STSs may be STSs which are not contained in any of the physical maps described 
above, In another embodiment the STSs may be a combination of STSs included in the physical maps described above 
and STSs which are not included in the integrated maps described above. 

The BAC vectors are screened with STSs until there is at least one positive BAC clone per STS. Preferably, a 
minimally overlapping set of 10,000 to 30,000 BACs having genomic inserts spanning the entire human genome are 
identified. More preferably, a minimally overlapping set of 10,000 to 30,000 BACs having genomic inserts of about 100* 
300kb in length spanning the entire human genome are identified. In a preferred embodiment, a minimally overlapping set 
of 10,000 to 30,000 BACs having genomic inserts of about 100-1 50 kb m length spanning the entire human genome is 
identified. In a highly preferred embodiment, a minimally overlapping set of lS r 0OO to 25,000 BACs having genomic 
inserts of about 100-200 kb in length spanning the entire human genome is identified, Alternatively, a smaller number of 
BACs spanning a set of chromosomes, a single chromosome, a particular sirbchromosoma! region, or any other desired 
portion of the genome may be ordered. The BACs may be screened for the presence of STSs as described in Example 2 
below. 
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Example 2 

Ordering of a BAC Library * Screening Clonic with STSs 
The BAC library is screened with a set of PCFUypcatile STSs ta identify donas containing thn STSs. To 
facSUtate PCR screening of several thousand clones, for example 200,000 clones, pools of clones are prepared. 

Three-dimensional pools of the BAC libraries ai9 prepared as described iri Clmmakov et ol. and are screened for 
the ability to generate an amplification fragment in amplification reactions conducted using primers derived from the 
ordered STSs. (Chumakov et aL (1395), supra). A BAC library typically contains 200,000 BAC clones. Since the average 
size of each insert is 100-300 kb< the overall size of such a library is equivalent 'to the size of at least about 7 human 
genomes. This library is stored as an array of individual clones in 518 384-wrii plates. 3t can be divided into 74 primary 
pooh (7 plates each). Each primary pool can then bo divided into 48 subpools prepared by using a three-dimensional 
pooling system based on the plate, row and column address of cadi clone (more particular, 7 subpools consisting of all 
clones residing in a given microliter plate; 16 subpoois consisting, of ell clones in a given row; 24 subpools consisting of 
all clones in a given column). 

Amplification reactions arc conducted on the pooled BAC clones using primers specific for the STSs. For 
example, the three dimensional pools may be screened with 45,000 STSs whose positions relative to one another and 
locations along the genome are known. Preferably, 1he three dimensional pools are screened with about 30,000 STSs 
whose positions relative to one another and locations along the genoma are known. In a highly preferred embodiment, 
the three dimensional pools are screened with about 20,000 STSs whose positions relative to one another and locations 
along the genome are known. 

Amplification products resulting from the amplification reactions arc detected by conventional agarose ge! 
electrophoresis combined with automatic image capturing and processing. PCH screening for a STS involves three 
steps: (1) identifying the positive primary pools; (2) for each positive primary pool, identifying the positive plate, row and 
column 'subpools' to obtain the address of the positive clone; (3) directly confirming the PCR assay on the identified 
clone. PCR assays are performed with primers specifically defining the STS. 

Screening is conducted as follows. First BAC DNA containing the genomic inserts is prepared as follows, 
Bacteria containing the BACs are grown overnight at 37°C in 120 fj\ of LB containing chloramphenicol (12 ^g/ml). DNA 
is extracted by the following protocol: 

Centrifuge 10 min at 4°C and 2000 rpm 

Eliminate supernatant and resuspend pellet in 120 *j\ tf io-2 ITris HCl 10 mM, EDTA 2 mM) 
Centrifuge 10 min at 4°C and 20Q0 rpm 

Eliminate supernatant and incubate pellet with 20 /A lyzozymft 1 mglml during 1 5 min at room temperature 
Add 20 p\ proteinase K 100pg/ml and incubate 15 min at 60° C 
Add 8 /j\ DNAse 2U///I and incubate 1 hr at room temperature 
Add 100 TE 10-2 and keep at -B0°C 
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PCH assays are performed using the following protocol: 
Final volume 
BAC DNA 
MqCI 2 
dNTP {each* 
primer (each) 

AmpIiTaq Gold DNA polymerase 
PCR buffer {10x - 0.1 M TiisiiCl pH6.3 0.5M KCI 

Tim amplification is pciiarmod on a Genius II thermocycler. After heating at 95°C for 10 min, 40 cycles are performed. 
Each cycle comprises: 30 sec at 95°C, 54°C for 1 min, and 30 sec at 72°C. Fur final elongation, 10 mifi at 72°C end 
tha amplification. PCR products arc analyzed on 1% agarose^! with 0.1 mg/ml ethidium bromide. 

Alternatively, a YAC (Yeast Artificial Chromosome) library can be used. Tha very large insert size, of the order 
of 1 megabase, is the main advantage of the YAC libraries. The library can typically include about 33,000 YAC clones as 
described in Churaakov et aL (1995, suprxl The YAC screening protocol may be the same as the one used for BAC 
screening. 

Tha known order of the STSs is then used to align tha BAC inserts in an ordered array (contig) spanning the 
whole human genome, if necessary new STSs to ba tested can be generated by sequencing the ends of selected BAC 
inserts. Subchromosomal localization of the BACs can ba established and/or verified by fluorescence in situ hybridization 
(FISH), performed on metaphasic chromosomes as described by Cherif et at 1990 and in Example 8 below. BAC insert 
size may be determined by Pulsed Field Gel Electrophoresis after digestion with the restriction enzyme NotL 

Finally, a minimally overlapping set oi BAC clones, with known insea size and subchromosomal location, 
covering the entire genome, a set of chromosomes, a single chromosome, a particular subchromosomal region, or any 
other desired portion of the genome is selected from the DNA library. For example, the BAC dones may cover at least 
lOOkb of contiguous genomic DNA, at least 25Qkb of contiguous genomic DNA, at least SOOkb of contiguous genomic 
DMA, at least 2Mb of contiguoas genomic DNA, at least 5Mb of contiguous genomic DNA, at least 1 QMb of contiguous 
genomic DNA, or at least 20Mb of contiguous genomic DNA. 

Identification of hisllslic markers 
In order to generate polymorphisms having the adequate informative content to be used as bialietk markers for 
genetic mapping, the sequences of random genomic fragments from an appropriate number of unrelated individuals are 
compared. Genomic sequences to be screened for biallslic markers may be generated by partially sequencing BAC 
inserts, preferably by sequencing the ends of BAC subclones. Sequencing the ends of an adequata number of BAC 
subclones derived from a minimally overlapping array of BACs such as those described above will allow the generation of 
biallelic markers spanning the entire genome, a set of chromosomes, a single chromosome, a particular subchromosomal 
region, or any other desired portion of the genome with an optimized.inter-marker spacing. 
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Thus, portions of the BACs in the selected ordered array are then subcloned and sequenced using, for example, 
the procedures described below. 

Example 3 
S iihdnninQ of BATs 

The cells obtained from three liters overnight culture of each BAG clone are treated by alkaline lysis using 
conventional techniques to obtain the PAC DNA containing the genomic DMA inserts. After ccntrifugation of the BAG 
DNA in a cesium chloride gradient, ca, 50/yg of BAG DNA are purified. B-Wjju of BAD DNA are sonicated using three 
distinct conditions, to obtain fragments within a desired size rongu. The obtained DNA fragments are end-repaired in a 
50 fj\ volume with two units of Vent polymerase for 20 min at 70°C ( in the presence of the four duoxytrlphbsphptus 
(100/iM), The resulting blunt-ended fragments arc separated by electrophoresis on preparative low-melting point 1% 
agarose gels (60 Volts for 3 hours). The fragments lying within a desired size range, such as 600 to 6,000 bp, are 
excised from the gel and treated with agarase. After chloroform extraction and dialysis on Microcon 100 columns, DNA 
in solution is adjusted to a 100 ng//il concentration. A ligation to a linearised, dephosphorylated, blunt-ended plasmid 
cloning vector is performed overnight by adding 100 ng of BAG fragmented DNA to 20 ng of pBluescript II Sk ( + ) vector 
DNA linearized by enzymatic digestion, and treating with alkaline phosphatase. The ligation reaction is performed in a 
10 /yl final volume m the presence of AO mislfA T4 DNA ligase (Epicentre). The iigated products arc electroporatcd into 
the appropriate ceils (ElcctroMAX Ecoli DH10B cells). 1PTG and X-gal ore added to the cell mixture, which is then 
spread on the surface of an ampirilfin-containing agar plate. After overnight incubation at 37°C, recombinant (white) 
colonies are randomly picked and arrayed in 96 well microplates for storage and sequencing. 

Alternately, DAC subcloning may be performed using vectors which possess both a high copy number origin 
of replication, whfch facilitates the isolation of the vector DNA. and a low copy mimber'origin of replication. Cloning of 
a genomic DNAr fragment into the high copy number origin of replication inactivates the origin such that clones 
cpntaimng a genomic insert replicate at low copy number. The low copy number of clones having a genomic insert 
(X /therein permrfs the inserts to be stably maintained. In addition, selection procedures may be designed which enable low 
copy numbef plasmids (la, vectors having genomic inserts therein) to be selected. In a preferred embodiment, BAC 
subclonwwiH be performed in vectors having the above described features and moreover enabling high throughput 
sequencing of long fragments of genomic DNA. Such high throughput high quality sequencing may be obtained after 
generating successive deletions within the subcloned fragments to be sequenced, using transposition-based or enzymatic 
systems, Such vectors are described in the U,S. Patent Application entitled "High Throughput DNA Sequencing Vector" 
<GEpET.015A, Serial No. 09/058,746), the disclosure of which is incorporated herein by reference. 

It will be appreciated that other subcloning methods familiar to those skilled in the art may also be employed. 
The resulting subclones ere then partially sequenced using, for example, the procedures described below* 
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Example 4 
Partial sequencing of BAC subclone 
The genomic DNA inserts in the subclones, such as the BAC subclones prepared above, are amplified by 
conducting PCR reactions on the overnight bacterial cultures, using primers complementary to vector sequences flanking 
the insertions. 

Tha sequences of the insert extremities (on average 500 bases at each end, obtained under routine sequencing 
conditions) are determined by fluorescent automated sequBncina on AB1 377 sequencers, using ARI Pi ism DNA 
Sequencinij Analysis software. Following gel image analysis and DNA sequence extraction, sequence data are 
automatically processed with adequate software to assess sequence quality. A proprietory base-caller, automatically 
flags suspect peaks, taking into account the shape of the peafcs, the Inter-peak resolution, and the noise level. The 
proprietary base-calier also performs an automatic trimming. Any stretch of 25 or fewer bases having mora than 4 suspect 
peaks is usually considered unreliable and is discarded. 

The sequenced regions of the subclones, such as the BAC subclones prepared above, are then analyzed in 
order to identify biallelic maikers lying therein. The frequency at which biatlolie markers will be detected in the 
screening process varies with the average level of heterozygosity desired. For example, if biallelic markers having an 
average heterozygosity rate of greater than 0.42 arc desired, they will occur every Z5 to 3 kb on average, Theref ore, 
on average, six 500 bp-genomic fragments have to be screened in order to derive 1 biallclic marker having an adequate 
informative content. 

As a prefe/ed alternative to sequencing the ends of an adequate number of BAC subclones, the above 
mentioned high throughput deletion-based sequencing vectors, which allow tha generation of a high quality sequence 
—^formation coveri/g fragments of ca. Gkb, may bo used. Having sequence fragments longer than 2-5 or 3kb enhances 
^ tho chances of identifying biallelic markers therein- Methods of constructing and sequencing a nested set of deletions 
are disclosed iythe U.S. Patent Application entitled 'High Throughput DNA Sequencing Vector* (GENSET.015A, Serial 
No. OSj058 r 7^), the disclosure of which is incorporated herein by reference. 

To identify biallelic markers using partial sequence information derived from subclone ends, 
such as the ends of the BAC subclones prepared above, pairs of primers, each one specifically 
defining a 500 bp amplification fragment, are designed using the above mentioned partial sequences. 
The primers used for the genomic amplification of fragments derived from the subclones, such as 
the BAC subclones prepared above, may be designed using the OSP software (HHlicr L. and Green 
P., Methods AppU 1:124*8 (1991), the disclosure of which is incorporated herein by reference). The 
GC content of the amplification primers preferably ranges between 10 and 75 %, more preferably 
between 35 and 60 %, and most preferably between 40 and 55 %. The length of amplification 
primers can range from 10 to 10O nucleotides, preferably from 10 to 50, 10 to 30 or more preferably 
10 to 20 nucleotides. Shorter primers tend to lack specificity for a target nucleic acid sequence and 
generally require cooler temperatures to form sufficiently stable hybrid complexes with the 
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templatc. Longer primers are expensive to produce and can sometimes sclf-hybridizc to form liairpb 
structures* 

All primers may contain, upstream of the specific targut bases, a common oiintmucleotids tail that serves as a 
sequencing primer. Those skilled in the an arc familiar with primer extensions which can be used for these purposes. 

To identify biatielic markers, the sequences corresponding to the paitial sequences determined above arc 
determined and compared in a plurality of individuals. Tl»e population used to identify bialielic markers having an 
adequate informative content preferably consists of ca. 100 unrelated individuals from a heterogeneous population. 

First, DNA is extracted from the peripheral venous blood of each donor using methods such as those described 
in Example 5. 

Example 5 
Extraction of DMA 

30 ml of blood ore taken from the individuals in the presence of EDTA. Cells (pellet) arc collected after 
centrifurjation for 10 minutes at 2000 rpm. Red cells are lyscd by a lysis solution (50 ml final volume : 10 mM Tris 
pH7.6; 5 mM MnCI 2 ; 10 mM NaCI). The solution is contrifuged (10 minutes, 2000 rpm) ds many times as necessary to 
eliminate the residual red cells present in the supernatant, after resuspension of the pellet in the lysis solution. 

The pellet of white cells is lysed overnight at 42°C with 3.7 ml of lysis solution composed of: 

• 3 mt TE 1 0-2 (T ris HC1 1 0 mM, EDTA 2 mM] / NaCI 0.4 M 

• 200 pi SOS 10% 

• 500 //i K«proteinaso (2 mg K-protcinase in T£ 10-2 / NaCI 0,4 M). 

For the extraction of proteins, 1 m! saturated NaCI (6M) [1/3.5 v/v) is added* After vigorous agitation, the 
solution is ccntrifuged for 20 minutes at 1 0000 rpm, . ' 

For the precipitation of DNA, 2 to 3 volumes of 1 00% othanol are added to the previous supernatant, and the solution is 
centrifuged for 30 minutes at 2000 rpm. The DNA solution is rinsed three times with 70% ethanol to eliminate salts, 
and centrifuged for 20 minutes at 2000 rpm. The pellet is dried at 37*C, end suspended in 1 ml TE 10-1 or 1 ml 
water. The ONA concentration is evaluated by measuring the 00 at 260 nm (1 unit OD - 50//n/ml DNA). 

To evaluate the presence of proteins in the DNA solution, the 0D 260 / OD 280 ratio is determined. Only DNA 
preparations havinn a 0D 260 / OD 2BO ratio between 1.8 and 2 are used in the subsequent steps described below. 

Once genomic DNA -from every individual in the given population has been extracted, it is preferred that a 
fraction of each DNA sample is separated, after which a poo! of DNA is constituted by assembling equivalent DNA 
amounts of the separated fractions into a single one. 

Second, the DNA obtained from peripheral blood as described above is amplified using the above mentioned 
amplification primars. 

Example 6 provides procedures that may be used in the amplification reactions, and the detection of 
polymorphisms within the obtained amplicons. 
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Example 6 

Amplification of DNA fron t Periphernl Blood 
and Identification of Bialldic Markers 
The ampHficntion of cadi sequence is performed on pooled DNA samples obtained as in Example 5 abova, usinu 
PCR (Polymerase Chain Reaction) as follows; 



-final volume 25 t A 

• genomic DNA 2ng///l 
-MgCl z 2mM 
.dNTP(cach) 2D0 jM 

• primer (cadi) 2,9 nu/;/l 

• Ampfi Taq Gold DNA polymerase (Pfirkin) 0.05 unit//;! 



-PCR buffer {10X-0.1 M Tris HCI pH 9.3. 0.5 M KCI) IX. 

The synthesis of primers is performed following the phosphoramidite method, on a 
GENSET UFPS 24.1 synthesizer. 

To reduce the expense of preparing amplification primers for use in the above procedures, short primers may be 
used. While primers and probes having between 15 and 20 {or more) nucleotides are usually highly specific to a given 
nudeic acid sequence, it may be inconvenient and expensive to synthesize a relatively long oligonucleotide fur each 
analysis. In order to at least partially circumvent this problem, it is often possible to use smaller but still relatively 
specific oligonucleotides that are shorter in length to create a manageable library. For example, o library of 
oligonucleotides comprising about 8 to 10 nucleotides is conceivable ant! has already been used for sequencing of a 
40,000 bp cosmid DNA (Studior, Prvc. Natl. Acad. Sti USA 66[1 81:6917.6921 {199&), the disclosure of which is 
incorporated herein by reference). 

Another potential way to obtain specific primers and probes with a small library of oligonucleotides is to 
generate longer, more specific primers and probes from combinations of shorter, less specific oligonucleotides. Libraries 
of shorter oligonucleotides, each one being from about five to eight nucleotides in length, have already been used 
(Kieleczawa et aL, Science 258:1767-1791 (1992); Kotler et aL, P/oc NotL Acod Set USA 90:42414245 (1993); 
Kaczorowski and SzybalsW, Anol BiacJwm 221:127-135 (1394), the disclosures of which are incorporated herein by 
reference). Suitable probes and primers of appropriate length can therefore be designed through the association of two 
or three shorter oligonucleotides to constitute modular primers. The association between primers can be cither covalent 
resulting from the activity of DNA T4 figase or non-covalent through base-stacking energy. 

Ths amplification is performed on a Perk'm Elmer 9600 Thermocycler or MJ Research PTC200 with heating lid. 
After heating at 95*0 for 10 minutes, 40 cyclos are performed. Each cycle comprises: 30 sec at 95°C, 1 minute at 
54° C, and 30 sec at 72°C. For final elongation, 10 minutes at 72° C ends the amplification. 

The quantities of the amplification products obtained are determined on 96-weH rnicrcttter plates, using a 
fluorimeter and Picogreen as intercalating agent {Molecular Probes), 
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The sequences of the amplification products are determined using automated dideoxy terminator sequencing 
reactions with a dye-primer cycle sequencing protocol. The products of the sequencing reactions are run on sequencing 
gels and the sequences arB determined using gel image analysis. 

The sequence data are evaluated using software designed to detect the presence of bialldic sites among the 
5 pooled amplified fragments. The polymorphism search is based on the presence of superimposed peaks in the 

electrophoresis pattern mulling from different bases occurring nt the same position. Because each dideoxy terminator 
is labeled with a difforent fluorescent molecule, the two peaks corresponding to a biaiielic site present distinct colors 
corresponding to two different nucleotides at the some position on the sequence. The software evaluates the intensity 
ratio between the two peaks and the intensity ratio between a given peak and surrounding peaks of the sarnu color. 
0 However, the prcsenca of two peaks can be an artifact duo to background noise. Tu exclude such on artifact, 

the two DNA strands arc sequenced and a comparison between the peaks is carried out. In oriler to be registered us a 
polymorphic sequence, the polymorphism has to be detected on both strands. 

The above procedure permits those amplification products which contain biaiielic markers to be identified. 
The detection limit for the frequency of biallellc polymorpliisms detected by sequencing pools of 1D0 
15 individuals is about 10% for the minor allele, as verified by sequencing pools of known allelic frequencies. However, 

more than 90% of the biaiielic polymorphisms detected by the pooling method have a frequency for the minor allele 
higher than 25%. Therefore, the biaiielic markers selected by this method hove a frequency of at least 10% for the minor 
allele and 90% or less for the major allele, preferably at least 20% for the minor allele and 80% or less for the major 
allele, more preferably at least 30% for the minor allele and 70% or less for the major allele, thus o heterozygosity rate 
20 higher than 0.1 8, preferably higher than 0.32, more preferably higher than 0.42. 

In an initial study to determine the frequency of biaiielic markers in the human genome that can be obtained 
using the above methods the following results were obtained. 300 different amplicons derived from TOO individuals, and 
covering a total of 150 kb obtained from different genomic regions, were sequenced. A total of 54 biaiielic 
polymorphisms were identified, indicating that there is one biaiielic polymorphism with a heterozygosity rate higher than 
25 0.18 (frequency of the minor allele higher than 10%), preferably higher than 0.3B (frequency of the minor allele higher 

than 25%), every 2^ to 3 kb. Given thot the human genome is about 3,1 0 B kb long, this indicates that, out of the 1 0 7 
biaiielic markers present on the human genome, approximately 10 $ have adequate heterozygosity rates for genetic 
mapping purposes. 

Using the procedures of Samples V6, sets containing increasing numbers of biaiielic markers may be 
30 constructed. For example, the procedures of Examples 1-6 are used to identify 1 to about 50 biaiielic markers. In some 

embodiments, the procedures of Examples 1*6 are used to identify about 50 to about 200 biaiielic markers. In other 
embodiments, the procedures of Examples 1-8 are used to identify about 200 to about 500 biaiielic markers. In some 
embodiments, the procedures of Examples 1-6 are used to identify about 1,000 biaiielic markers. In other embodiments, 
the procedures of Examples 1*6 are used to identify about 3,000 biaiielic markers. In further embodiments, the 
35 procedures of Examples 1-6 are used to identify about 5,000 biaiielic markers. In another embodiment, the procedures 

of Examples 1-6 are used to identify about 10,000 biaiielic markers. In still another embodiment, the procedures of 
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Examples 1-B are used lo identify about 20,000 biallelic markers. In stilt another embodiment, the procedures of 
Examples 1-B are used to identify about 40,000 biallelic markers. In still another embodiment, the procedures of 
Examples 1-6 are used to identify about 60,000, biallelic markers. In still another embodiment, the procedures of 



Examples 1-B are used to identify mora than 100,000 biallelic markers. In a further embodiment, the procedures of 
Examples 1-6 arc used to identify more than 120,000 biallelic markers* 

As discussed above, the ordered nucleic acids, such as the inserts in BAC clones, which contain thu biallelic 
markers of the present invention may span a portion of the genome. For example, the ordered nucleic acids may span at 
least 100kb of contiguous genomic ONA, at least 250kb of continuous genomic DMA, at least SOOkb of contiguous 
genomic DNA, at least 2Mb of contiguous genomic DMA, at least 5Mb of contiguous genomic DNA, at least 10Mb of 
contiguous genomic ONA, or at least 20Mb of continuous genomic DNA, 

In addition, groups of biallelic markers located in proximity to one another along the genome may be identified 
within these portions of the genome for use in haplotyping analyses as described below* The biallelic markers included 
in each of these groups may be located within a genomic region spanning Jess than 1kb, from 1 to 5kb, from 5 to lUkb, 
from 10 to 25kb, from 25 to 50fcb, from 50 to 150kb, from 150 to 250kb, from 250 to BODkl), from BOOkb to 1Mb, or 
more than 1Mb. It will be appreciated that the ordered DMA fragments containing these groups o! biallelic markers need not 
completely cover the genomic regions of those lengths but may instead be incomplete contigs having one or more gaps 
therein. As discussed in further detail below, biaflclic markers may be used in single maker and haplotypc association 
analyses regardless of the completeness of the corresponding physical contig harboring them. 

Using the procedures above, 653 biallelic markers, each having two alleles, were identified using sequences 
obtained from BACs which had been localized on the genome. In some cases, markers wefe identified using pooled BACs 
and thereafter reassigned to individual BACs using STS screening procedures such as those described in Examples 2 and 
7* The sequences of 50 of these 653 biaQehc markers are provided in the accompanying Sequence Listing as SEQ ID 
Nos. 1-50 and 51-100 (with SED1D Nos. 1-50 being one allele of these 50 biallelic markers and SEQ ID Nos. 51-100 
being the other allele of these 50 biallelic markers). Although tha -sequences of SEQ ID Nos. 1-50 and 51-100 will be 
used as exemplary markers throughout the present application, it will be appreciated that the biallelic markers used in 
the maps of the present invention are not limited to these particular markers, nor arc they limited to having the exact 
flanking sequences surrounding the polymorphic bases which arc enumerated in SEQ ID Nos. 1-50 and 51-100 Rather, 
it will be appreciated that the flanking sequences surrounding tha polymorphic bases of SEQ ID Nos. 1-50 and 51-100 
may be lengthened or shortened to any extent compatible with their intended use and the present invention specifically 
contemplates such sequences. The sequences of these 653 biallelic markers, including the sequences of SEQ ID Nos. V 
50 and 51-100 may be used to construct the maps of the present invention as well as in the gene identification and 
diagnostic techniques descrihed herein. It will be appreciated that the biallelic markers referred to herein may be of any 
length compatible with their intended use provided that the markers include the polymorphic base, and the present 
invention specifically contemplates such sequences. 



Examples 1-B are used to identify about 80,000 biallelic markers. In a still another embodiment, the procedures of 
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Qrdering of biaHelin markers 
Biallelic markers can be ordered to determine their positions along chromosomes, preferably subchromosomal 
regions, most preferably along the above described minimally overlapping ordered BAC arrays, as follows. 

Tha positions of the biallclic markers along chromosomes may be determined using a variety of mnihodotoflics. 
In one approach, radiation hybrid mapping is used. Radiation hybrid (nil) mapping is a somatic cell rjenetir. approach that 
can be used for hirjh resolution mapping of the human unnoma. In this approach, ceil lines containing one or mum human 
chromosomes arc lethaQy irradiated, breaking each chromosome into fragments whose size dnpuiids on tha radiation dose. 
These fragments are rescued by fusiun with cultured rodent cells, yielding subclones containing different portions uf the 
human genome. This technique is described by Benham et al. [Genomics 4:509-517. 1989) ami Cm et al., [Science 
250:245-250, 1990), the entire contents of which ore hereby incorporated by reference. The random and independent 
nature of the subclones permits efficient mapping of any himwn genome marker. Human DNA isolated from a panel uf 80- 
100 ceil lines provides a mapping reagent for ordering biailolic markers. In this approach, the frequency of breakage 
between markers is used to measure distance, allowing construction of fine resolution maps as has been done fur ESTs 
(Schuler et al, Science 274:540-546,-1096, hereby incorporated by reference). 

Rll mapping has been used to generate a higlvrcsolution whole genome radiation hybrid map of human 
ctiromoscme 17q22-q25.3 across the genes for growth hormone (GUI and thymidine kinase (TK) (Foster et aL f Genomics 
33:185-1 9Z 1995), the region surrounding tte Gorlin syndrome gene (Obermayr et aL, Eur. J. Hum. Genet, 4:242*245, 
1996), GO loci covering the entire short arm of chromosome 12 (Raeymaekers et al, Genomics 29:170-178, 1995), the 
region of human chromosome 22 containing the neurofibromatosis type 2 locus (Prazer et al., Genomics 14:574-584, 1992) 
and 13 loci on the long arm of chromosome 5 [Warrington et aL, Genomics 1 1:701-703, 1991). 

Alternatively, PGR based techniques and human-rodent somatic cell hybrids "may be used to determine the 
poshions of the biallelic markers on the chromosomes. In such approaches, oligonucleotide primer pairs which arB capoble of 
generating amplification products containing the polymorphic bases of the biallelic markers are designed. Preferably, the 
oligonucleotide primers ars 18-23 bp in length and are designed for PGR amplification. The creation of PCR primers from 
known sequences is well known to those with skill in the art For a review of PCR technology see Ertich, HA, PCR 
Technology; Principles and Applications for DMA Amplification . 1992. W.H. Freeman and Co., New York- 

The primers are used in polymerase chain reactions (PCR) to amplify templates from total human genomic DMA. 
PCR conditions are as follows: 60 ng of genomic DNA is used us a template for PCR with 80 ng of each oligonucleotide 
primer, 0.6 unit of Taq polymerase, and 1 fiCu of a 3Z P-iabeied dcoxycyfidine triphosphate. The PCR is performed in a 
microplate thetmocyder Pcchne) under the following conditions: 30 cycles of 94°C, 1.4 mm; 55°C, 2 min; and 72°C, 2 min; 
with a final extension at 72°C for 10 min. The amplified products are analyzed on a 6% polyacrylamhto sequencing gel and 
visuafced by autoradiography. 11 the length of the resulting PCR product is identical to the length expected for an 
amplification product containing the polymorphic base of the biallelic marker, then tha PCR reaction is repeated with DNA 
templates from two panels of human-rodent somatic cell hybrids, BIOS PCRabls DNA (BIOS Corporation) and NIGMS 
HumarvRodent Somatic Cell Hybrid Mapping Panel Number 1 (MGMS, Camden, NJK 
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PCR is used to screen 3 series of somatic cell hybrid cell lines containing defined sets of human chromosomes for 
the presence of a given biallclic marker. DNA is isolated from the somatic hybrids and used as starting templates for PCR 
reactions using the primer pairs from the bialleiic marker. Only those somatic cell hybrids with chromosomes containing the 
human sequence corresponding to t!«i bialleiic marker w2l yield an amplified fragment. The biallclic markers are assigned to 

5 ? chromosome by analysis of the segregation pattern of PGR products from the somatic hybrid DMA templates, Tim single 

human chromosome present in all cull hybrids that m rise to an amplified fragment is the chromosome containing that 
bialleiic marker. For a review of techniques and Analysis of results from somatic cell gene mapping experiments. (Suu 
Ledbcttcret al.. Genomic? 6:475-481 (1990M 

Example 7 describes a preferred method for positioning of bialleiic markers on clones, such as BAC clones, 

1 0 obtained from genomic DNA libraries. 

Example 7 

Screening BAG libraries with bialleiic markers 
Amplification primers enabling the specific amplification of DNA fragments carrying the bialleiic markers (including 
the 653 bialleiic markers obtained above (which include the sequences of SEQ ID Nos 1-50 and 5MD0) may be used to 
15 screen clones in any genomic DNA library, preferably the BAC libraries described above for the presence of the bialiulic 

markers. 

Pairs of primers were designed which allowed the amplification of fragments carrying the 053 biallclic markers 
obtained above. The amplification primers may be used to screen clones in a genomic DNA library for the presence of the 
G53 bialleiic markers. For example, pairs of amplification primers of SEQ ID Nos. 101-150 and 151-200 may ba used to 
20 amplify fragments which include the polymorphic bases of the bialleiic markers of SEQ 10 Nos. 1-50 and 51-100. 

It wilt be appreciated that amplification primers for tho biaJlalic markers may be any sequences which allow the 
specific amplification of any DNA fragment carrying the markers and may be designed using techniques familiar to those 
skilled in the art* The amplification primers may be oligonucleotides of 8, 10, 15, 20 or mare bases in length which 
enable the amplification of any fragment carrying the polymorphic site in the markers, The polymorphic base may be m 
25 the center of the amplification product or, alternatively, it may be located off-center. For example, in some 

embodiments, the amplification product produced using these primers may be at least 100 bases in length (Lb. 50 
nucleotides on each side of the polymorphic base in amplification products in which the polymorphic base is centrally 
located). In other embodiments, the amplification product produced using these primers may be at least 500 bases in 
length (i.e e 250 nucleotides on each side of the polymorphic base in amplification products in which the polymorphic base 
30 is centrally located). In still further embodiments, the amplification product produced using these primers mny be at 

least 1000 bases in length (Us. 500 nucleotides on each side of the polymorphic base in amplification products in which 
the polymorphic base is centrally located). Amplification primers such as those described above are included within the 
scope of the present invention. 

The locarization of tMelic markers on BAC clones is performed essentially as described in Example 2. 
35 The BAC clones to be screened are distributed in three dimensional pools as described in Example 2. 
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Amplification reactions art* conducted on the pooled BAC clones using primers specific for the biallelic markers 
to identify DAC clonus which contain the biailoiic markers, using procedures essentially similar to those described in 
Example 2. 

Amplification products resulting from the amplification reactions are detected by conventional agarose gel 
Gluctrophorcsis combined with automatic image capturing and processing. PCH screening for a biailoiic marker involves 
three steps: (1] identifying the positive primary pools; (2) for each positive primary pools, identifying the positive plate, 
row and column 'subnools' to obtain the address of the positive clone; (3) directly confirming the PCH assay on the 
identity clone. PCR assays are performed with primers defining the biallelic marker. 

Scieening is conducted as follows. First BAC DNA is isolated as follows. Bacteria containing the genomic 
inserts are grown overnight at 37°C in 120 //I of LB confuting chloramphenicol (12 ^g/ml). DNA is extracted by the 
following protocol: 

Centrifuge 1 0 min at 4 a C and 2O00 rpm 

Eliminate supernatant and resuspend pellet in 120 jj\ TE 1 0-2 (Tris HCt 10 mM, EDTA 2 mM) 
Centrifuge 10 min at 4°C and 2000 rpm 

Eliminate supernatant and incubate pellet with 20 jj\ lyzozyme 1 mg/ml during 15 min at room temperature 
Add 20 {A proteinase K 1 00//g/ml and incubate 1 5 min at 60°C 
Add 8 jj\ DNAsc 2U//;I and incubate 1 hr at room temperature 
Add 1 00 iA TE 10-2 and keep at -80°C 



PCR assays are performed using the following protocol: 

Final volume * 45 //I 

BAC DNA 1.7ng///l 
MgCl 2 2mM 
dNTP(each) 200 //M 

primer (each) 2,9 ng///l 

Ampfi Taq Gold DNA polymrase 0.D5 un\tlp\ 

PCH buffer (10* - 0.1 MTrisHClpH8.3 0.5MKCI 1x 



The amplification is performed en a Genius !! thcrmocyder. After heating at 95°C for 10 min ? 40 cycles ore 
performed. Each cycle comprises: 30 sec at 95°C, B4°C for 1 min, and 30 sec at 72°C. For final elongation, 1 0 min at 
72 e C end the amplification. PCR products are analyzed on 1 % agarose gel with 0.1 rag/ml eihtdtum bromide. 

Using such procedures, a number of BAC clones carrying selected biallclic markers can be isolated. The 
position of these BAC clones on the human genome can be defined by pert orming STS screening as described in Example 
2. Preferably, to decrease the number of STSs to be tested, each BAC can bs localized on chromosomal or 
subchromosomal regions by procedures such as those described in Examples 6 and 9 below. This localization will allow 
the selection of a subset of STSs corresponding to the identified chromosomal or subchromosomal region. Testing each 



WO 99/04038 




PCT/IB98/01193 



-29- 

BAC with such a subset of STSs and taking account of the position and order of the STSs along the genome will allow a 
refined positioning of the corresponding balletic marker along the genome. 

In other embodiments, if the DMA library used to isolate BAG inserts or any type of genomic DMA fragments 
harboring the selected biallelic markers already constitute a physical map of the genome or any portion thereof, using the 
known order of the DNA fragments will allow the order of the bialfefic markers to be established. 

As discussed above, it will be appreciated that markers carried by the same fragment of genomic DMA, such os 
Urn insert in a BAC clone, need not necessarily bo ordered with respect to ona another within the genomic fragment to 
conduct single point or haplotypc association analysts. However, in other embodiments of the present maps, the urdor of 
bcaHutic markers carried hy the same fragment of genomic DMA may be detennined. 

The positions of the biallelic markers used to construct the maps of the present invention, including the 653 
biallelic markers obtained above, may be assigned to subchromosomai locations using Fluorescence in Situ Hybridization 
(FISH) (Cherif ct aL, Proc. ttatl Acad. Set USA, 87:6639-6643 (1990), the disclosure of which is incorporated herein by 
reference). FISH analysis is described in Exampte 8 below. 



5 Example 8 

Assignment nf Biallelic Markers to Subchromnsomaf Reoions 
Metophase chromosomes ore prepared from phytohemagglutinin {PIIAJ-stimulatsd blood cell donors. PIIA- 
stimulated lymphocytes from healthy males arc cultured for 72 h in RPMI-1640 medium. For synchronization, methotrexate 
(10 is added for 17 h, followed by addition of 5-bromDdeaxyuridine (5-BudR, OJ mM) for 6 h, Calcemid (1 f.igfrnl) is 
0 added for the last 15 min before harvesting the cells. Cells are collected, washed in RPMI, incubated with o hypotonic 

solution of KC1 (75 mM) at 37°C for ]5 min and fixed in tliree changes of methnnofcacctic acid (3:1 ). The cell suspension is 
dropped nnta a glass slide ond air-dried. 

BAC clones carrying the biallelic markers used to construct the maps of the present inventicn [including the 653 
biaHelic markers obtained aboveto) can be isolated as described above. Those BACs or portions thereof, including fragments 
IS carrying said biallelic markers, obtained for example from amplification reactions using pairs of amplification primers as 

described abovB. can be used as probes to be hyhridaed with metaphasic chromosomes. It will ba appreciated that the 
hybridization probes to ba used in the contemplated method may be generated using alternative methods well known to 
those skilled in the art Hybridkation probes may have any length suitable for this intended purpose- 

Probes are then labeled with biotin-16 dUTP by nick translation according to the manufacturer's instructions 
30 (Bethasda Research Laboratories, Bethesda, MO), purified using a Sephadex G*50 column (Pharmacia, lipssala, Sweden) and 

precipitated. Just prior to hybridization, the DMA pallet is dissolved in hybridization buffer (50% formamide, 2 X SSC, 10% 
dextran sulfate, 1 mgjrnl sonicated salmon sperm DNA, pH 7) and tha probe is denatured at 70^ for 5-1 0 min. 

Slides kept at -20 G C are treated for 1 h at 37°C with RNase A (100 fig/ml), rinsed three times in 2 X SSC and 
dehydrated in an ethane! series. Chromosome preparations are denatured in 7D% formamide. 2 X SSC for 2 min at 70° C. 
35 then dehydrated at 4°C. The sGdes are treated with proteinase K (10 jag/100 ml In 20 mM Tris-HCI, 2 mM CaCt 2 ) at 37°C 
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for 8 min and dehydrated. Tim hybridization mixture containing the probe Is placed on the slide, covered with a covcrsiip, 
scaled with rubber cemtmt and incubatod overnight in a humid chamber at 37°C. After hybridization and post-hybridization 
washes, the biotinylatcd probe is detected by avidin-FITC and amplified with additianallayers of biotinylatcd goat anti-avidin 
and avidm-FFTC. For chromasuinal localization, fluorescent B-bands are obtained as previously doseribed {Chcrif et al.,{1990) 
supra,). The slides are ohscrved under 3 IEICA fluorescence microscope (DMRXA). Chromosomes are countersigned with 
propidium iodide and the fluorescent signal of the probe appears as two symmetric;)! ycllnw-grecn spots on both cliromatids 
of tlic fluorescent R-band chromosome (red). Thus, a particular biallelic marker may be localized to a particular cytogenetic 
R-band on a given chromosome. 

The above procedure was usod to confirm the subchromosomaJ location of 25% of the BAC clones harboring the 
653 markers obtained abovB. In particular, the 50 markers of SEQ ID Nos. 1-50 and 51-100 were assigned to 
subchromosomal regions of chromosome 21. SimplB identification numbers were attributed to each BAC from which the 
markers are derived. Figure 1 is a cytogenetic map of chromosome 21 indicating the subchromosomal regions therein. Table 
1 Dsts tlte internal identification numhor of the localized biaJleiic markers, the interna! identification number of the BACs from 
which the markers were derived, the size of the BAC nscrt, the average mtermarkcr distance in the DAC insert and the 
subchromusomal locations of the biallelic markers- The sequences of the bcalized markers are provided as SEQ ID Nos. 1-50 
and 51-100 in the accompanying sequence fisting. Amplification primers for generating amplification products containing 
the polymorphic bases of these markers are also provided as SEO ID Nos. 101-150 and 15V200 in tho accompanying 
sequence listing. Microsequencing primers for use in determining the identities of the polymorphic bases of these biallelic 
markers are provided in the accompanying Sequence Listing as SEQ ID Nqs. 201-250 and 251-300. 

The rate at which biallelic markers may be assigned to subchromusomal regions may be enhanced through 
automation. For example, probe preparation may be performed in a microtiter plate format, using adequate robots. The rate 
at which blallefic markers may be assigned to subchromusomal regions may be enhanced using techniques which permit the 
in situ hybridization of multiple probes on a single microscope slide, such es those disclosed in Larin ct al., Nucleic Acids 
Research 22: 3683-3592 (1394), the disclosure of which is incorporated herein by reference. In the largest test format 
descrfoed, different probes were hybridized simultaneously by applying them directly from a 06- well microtiter dish which 
was inverted on a glass plate. Software for image data acquisition and analysis that is adapted to each optical system, test 
format and fluorescent probe used, can be derived from the system described in Lichter et al. Science 247: 64-69 (1990), 
the disclosure of which is incorporated herein by reference. Such software measures the relative distance between the 
center of the fluorescent spot corresponding to the hybridized probe and the telomeric end of the short arm of the 
corresponding chromosome, as compared to the total length of the chromosome. The rate at which biallelic markers are 
assigned to suhchromosomal locations may be further enhanced by simultaneously applying probes labeled with different 
flouqrescent tags to each well of the 26 well dish, A further benefit of conducting the analysis on one slide is that it 
facilitates automation, since a microscope having a moving stage and the capability of detecting fluorescent signals in 
different metaphasa chromosomes could provide the coordenates of each probe on the metaphase chromosomes distributed 
on the 95 welldish. 

Example 9 below describes an alternative method to position biallelic markers which allows their assignment to 
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human chromosomes. 

tomplfl 9 

Assignment nf Bigllelic Markers to Human Chromosomes 
5 The biallelic markers used to construct the maps of the present invention, including the 653 biallelic markers 

obtained above {which include the sequences of SCC1 ID Nos. 1-50 and 51-100), may be assigned to a human 
chromosome using monosomal analysis as described below. 

The chromosomal localization of a biallelic marker can be performed through the use of somatic cnil hybrid 
panels. For example 24 panels, each panel containing a different human chromosome, may bo used (Russell et aL, 
10 Somvt CallMoL Genet 22:425-431 (1996); Drwinga et al.. Genomics 16:311-314 (1993), the disclosures of which are 

incorporated herein by reference). 

The biallelic markers are localized as follows. The DNA of each somatic cell hyhrid is extracted and purified. 
Gcnumfc DNA samples from a somatic cell hybrid panel are prepared as follows. Cells arc lysed overnight at 42°C with 
3.7 ml of lysis solution composed of: 
15 3 ml TE 10-2 {Tris HC1 10 rnM, EDTA 2 mM) f Nad 0.4 M 

200/yI SDS 10% 

500/4 ^proteinase (2 mg K-proteinajc in TE 10-2 / NaCI 0.4 M) 

Far the extraction of proteins, 1 ml saturated NaCI |6M) {1/3.5 v/v) is added. After vigorous agitation, the 
solution is centrifuged for 20 min at 10,000 rpm. For the precipitation of DNA, 2 to 3 volumes of 100 % etlianol are 

20 added to the previous supernatant and the solution is centrifuged for 30 min at 2,000 rpm. The DNA solution is rinsed 

three times with 70 % ethanol to eliminate salts, and centrifuged for 20 min at 2,Q0Q rpm. The pellet is dried at 37°C, 
and resuspended in 1 ml TE 10-1 or 1 ml water. The DNA concentration is evaluated by measuring the 00 at 260 nm (1 
unit OD - SO /yg/rnl DNA)* To determine the presence of proteins in the DNA solution, the OD 2eo /OD 2BO ratio is 
determined. Only DNA preparations having a OD 2SO /OD 2 8o ratio between 1.8 and 2 are used in the PCR assay, 

25 Then, a PCR assay is performed on genomic DNA with primers defining the biallelic marker. The PCR assay is 

performed as described above for BAC screening. The PCR products are analyzed on a 1% agarose gel containing 0.2 
mg/ml sthidium bromide. 

The ordering analyses described above may be conducted to generate an integrated genome wide genetic map 
comprising ahout 20,000 biallelic markers (1 biallelic marker per BAC if 20,000 BAC inserts are screened), In some 
3D embodiments, the map includes one or more of the 653 markers obtained abovo (which include the sequences of SEQ ID 

Nos. 1-50 and 51-100 or the sequences complementary thereto), 

In another embodiment the above procedures are conducted to generate a map comprising about 40,000 
markers (an average of 2 biallelic markers per BAC if 20.000 BAC inserts are screened). In some embodiments, the map 
includes one or more of the 6B3 markers obtained above (which include the sequences of SEQ ID Nos. 1-50 and 51-100 
35 or the sequences complementary thereto). 

In a further embodiment preferred embodiment, the above procedures are conducted to generate a map 
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comprising about BO,000 markers ( an average of 3 bialleiic markers per BAC if 20 P OOD BAC inserts arc screened). In 
some embodiments, the map includes one or more of the 653 markers obtained above (which include the sequences of 
SEQ 10 Nos. 1-50 and 51-100 or the sequences complementary thereto). 

In a further embodiment preferred embodiment the above procedures arc conducted to generate a map 
5 comprising about 80,000 markers (an averse of 4 biallolic murker* per BAC if 20,000 BAC inserts are screened). In 

some embodiments, the map includes one or more of the 653 markers obtained above (which include the sequences of 
SEQ 10 Nos. 1-50 and 5M00 or tho sequences complementary thereto). 

Jn yet another embodiment, the above procedures are conducted to generate a map comprising about 100.000 
markers (an average of 5 bialleiic markers per BAC if 20,000 BAC inserts are screened). In some embodiments, the map 
10 includes one or more of the 653 markers obtained above {which include the sequences of SEQ ID Nos. 1-50 atul 5M00 

or the sequences complementary thereto)* 

In a further embodiment the above procedures are conducted to generate a mop comprising about 120,000 
markers (an average of 6 bialleiic markers per BAC if 20,000 BAC inserts arc screened). In some embodiments,, the map 
includes one or more of the G53 markers obtained above {which include the sequences of SEQ ID Nos. 1-50 and 51-100 
1 5 or the sequences complementary thereto. 

Alternatively, maps having the above-specified average numbers of bialleiic markers per BAC which comprise 
smaller portions of the genome, such as a sat of chromosomes, a single chromosome, a particular subchrcrnosoma! 
region, or any other desired portion of the genome, may also be constructed using the procedures provided herein. 

In some embodiments, the bialleiic markers in the map are separated from one another by an average distance 
20 of 10-2Q0kb. In further embodiments, the bialleiic markers in the map are separated from one another by an average 

distance of 15-150kb. In yet another embodiment, the bialleiic markers in the map are separated from ons another by an 
average distance of 20-1 OOkb. In other embodiments, the bialleiic markers in the mop are separated from one another 
by en average distance of 100-150kb. In further embodiments, the bialleiic markers in the map aro separated from one 
another by en average distance of 50-1 OOkb. In yet another embodiment, the bialleiic markers in the map are separated 
25 from one another by an average distance of 25-50kb. Maps having the above-specified intermarker distances which 

comprise smaller portions of the genome, such as a set of chromosomes, a single chromosome, a particular 
subchromosomal region, or any other desired portion of the genome, may also be constructed using the procedures 
provided herein. 

Figure 2, showing the results of computer simulations of the distribution of inter-marker spacing on a randomly 
30 distributed sat of bialleiic markers, indicates the percentage of biatlcfic markers which will be spaced a given distance 

apart for a given number of markers/BAC in the genomic map {assuming 20,000 BACs constituting a minimally 
overlapping array covering the entire genome are evaluated). One hundred iterations were performed for each 
simulation (20,000 marker map, 4O f 00O marker map, 60,000 marker map, 120,000 marker map). 

As illustrated in Figure 2a r 98% of inter-marker distances will be lower than 150kb provided 60,000 evenly 
35 distributed markers are generated (3 per BAC); 90% of inter-marker distances wilt be lower than 1 50kb provided 40,000 
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eveniy distributed marker? are generated (2 per 8AC); and 50% of inter-marker distances will ba lower than 150kb 
provided 20,000 svenf/ distributed markers are generated (1 per BAC), 

As illustrated in Figure 2b, 98^4 of inter-marker distances will be lower than 80kb provided 120,000 evenly 
distributed markers are generated (6 per BAC); 80% of inter-markcr distances will be lower than BOkb provided 60,000 
evenly distributed markers are generated (3 per BAC); and 15% of inter-marker distances will bo lower than BOkb 
provided 20,000 evenly distributed markers are generated (1 per BAC). 

As already mentioned, liiofi density biallclic marker maps allow association studies to be performed to identify 
genes involved in complex traits. 

Association studies examine the frequency of marker alleles in unrelated trait positive (T+) individuals 
compared with trait negative (T-) controls, and ars generally employed in the detection of polygenic inheritance, 

Association studies as a method of mapping genetic traits rely on the phenomenon of linkage disequilibrium, 
which is described below. 

Linkage Oiscnuifihrhim 

If two genetic loci lie on the same chromosome, then sets of alleles on the same chromosomal segment (called 
haplotypes) tend to be transmitted as a block from generation to generation. When not broken up by recombination, 
haplotypes can be tracked not only through pedigrees but also through populations. The resulting phenomenon at the 
population level is that the occurrence of pairs of specific alleles at different loci on the same chromosome is not 
random, and the deviation from random is called linkage disequilibrium (LD). 

If a specific allele in a given gene is directly involved in causing a particular trait T, its frequency will be 
statistically increased in a T+ population when compared to the frequency in a T- population. As a consequence of the 
existence of LD, the frequency of all other alleles present in the haplotype carrying the trait-causing allele (TCA) will also 
be increased in T+ individuals compared to T- individuals. Therefore, association between the trait and any allele in 
linkage disequilibrium with the trait-causing allele will suffice to suggest the presence of a trait-related gene in that 
particular allele's region. Linkage disequilibrium allows the relative frequencies in T+ and T- populations of a limited 
number of genetic polymorphisms (specifically biaflelie markers) to be analyzed as an alternative to screening ell possible 
functional polymorphisms in order to find trait-causing alleles. 

The present invention then also concerns biallelic markers in linkage disequilibrium with the specific biallclic 
markers described above and which are expected to present similar characteristics in terms of their respective 
association with a given trait. In a preferred embodiment, the present invention concerns the biallclic markers that are in 
linkage disequilibrium with the 653 biallelic markers obtained above (which include the sequences of SEQ ID Nos. 1-50 
and 5M00 or the sequences complementary thereto). 

LD among a set of biallelic markers having an adequate heterozygosity rate can be determined by genotyping 
hetween 50 and 1000 unrelated individuals, preferably between 75 and 200, more preferably around 100. Genotyping a 
biallelic marker consists of determining the specific allele carried by an individual at the given polymorphic base of the 
biallelic marker. Genotyping can be performed using similar methods as those described above for the generation of the 
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faiallsiic markers, or using other genotyping methods such as those further described below. 

LD between any pair of btatlciic markers comprising at least one of the biallefic markers of the present 
invention (MpMj) can bo calculated for every allele combination {M| !r M il; M ilf M j2; M^M^ and M l2 ,M j2 ), according to the 
Piazza formula : 



AM, k ,M jr V04-V (94 + 03) (04 +021 , where: 

04 frequency of genotypes not having allele k at M ; and nut having allele I at Mj 

63- - + - frequency of genotypes not having allele k at Mj and having allele I at Mj 
62- + - - frequency of genotypes having allele k at Mi and not having allele I at Mj 

Linkage disequilibrium (LD) between pairs of bialleltc markers (Mi. Mj) can also be calculated for every allele 
combination (Mi1,Mjt ; Mi1.Mj2; Mi2,Mj1 ; Mi2,Mj2) according to the maximum likelihood estimate (MLE) for delta (the 
composite linkage disequilibrium coefficient), as described by Weir (B.S. Weir, Genetic Data Analysis, (199G), Sinauer 
Ass. Eds r the disclosure of which is incorporated herein by reference). This formula allows linkaoc disequilibrium 
between alleles to be estimated when only genotype, and nut haplotype, data are available. This LD composite test 
makes no assumption for random mating in the sampled population, and thus seems to be more appropriate than other 
LO tests for genotypic data. 

The skilled person win readily appreciate that other 10 calculation methods can be used without undue 
experimentation. 

Example 10 illustrates the measurement of LD between a publicly known biallclic marker, the "ApuE Site A\ 
located within the Alzheimer's related ApoE gene, and other biallclic markers randomly derived from the genomic region 
containing the ApoE gene. 

Example VP 
Measurement of Linkage Disequilibrium 

As originally reported by Strittmattor et at. and by Saunders et al. in 1993, the Apo E e4 allele is strongly 
associated with both late-onset familial and sporadic Alzheimer's disease (AD). (Saunders, A.M. Lancet 342: 710711 
(1993) and Strittmater, WJ. et al.. Proc. Natl. Acad. Sci. U.S.A. 90: 1977-1981 (1993), the disclosures of which are 
incorporated herein by reference). The 3 major isoforms of human Apoiipoprotein E (apoE2 r -E3, and -E4), as identified by 
isoelectric focusing, arc coded for by 3 alleles (e 2, 3, and 4). The e 2, e 3, and e 4 isoforms differ in amino acid 
sequence at 2 sites, residue 112 (called site A) and residue 158 (called site B). The ancestral isoform of the protein is 
Apo E3, which at sites A/B contains cysteinejarginine, while ApoE2 and -E4 contain cysteine/cysteine and 
arginine/argintne, respectively (Weisgreber. K.H. et al, J. Biol Chem. 258: 9077-9083 (1981); Rail, 5.C. et a!., Proc. 
Natl. Acad. Sci. U.S.A. 79: 46964700 (1982), the disclosures of which are incorporated herein by reference). 

Apo E s 4 is currently considered as a major susceptibility risk factor for AD development in individuals of 
different ethnic groups (specially in Caucasians and Japanese compared to Hispanics or African Americans), across all 
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ages between 40 and SO years, and in both men and women, 3s reported recently in a study performed on 5930 AD 
patients and 8607 controls (Farrcr et at JAMA 278;1349 1356 (1997), the disclosure of which is incorporated herein 
by reference). More specifically, the frequency of a C base coding for argininc 112 at site A is significantly increased in 
AD patients. 

Although the mechanistic link between Apo E e 4 and neuronal degeneration characteristic of AD remains to be 
established, current hypotheses suggest that the Apo E genotype may influence neuronal vulnerability by increasing the 
deposition and/or aggregation of the amyloid beta peptide io the brain or by indirectly reducing energy availability to 
neurons by promoting atherosclerosis. 

Usinu the methods of the present invention, biallelic markers that are in tho vicinity of the Apo E site A were 
generated and the association of one of their alleles with Alzheimer's disease was analyzed. An Apo E public marker 
(stSG94) was used to screen a human oenome BAG library as previously described. A BAG, which gave a unique FISH 
hybridization signal on chromosomal region 19q 13.2.3, the chromosomal region harboring the Apo E gene, was selected 
for finding biallelic markers in linkage disequilibrium with the Apo E gene as follows. 

This BAC contained an insert of 205 kb that was subcluned as previously described. Fifty DAG subclones were 
randomly selected and sequenced. Twenty five subclone sequences were selected and used to design twenty five pnirs 
of PGR primers allowing 500 fap-amp!icon$ to be generated. These PGR primers were then used to amplify the 
corresponding genomic sequences in a pool of DNA from 100 unrelated individuals (blood donors of French origin) as 
already described. 

Amplification products from pooled ONA were sequenced and analyzed for the presence of biallelic 
polymorphisms, as already described. Five amplicqns were shown to contain a polymorphic base in The poof of 100 
unrelated individuals, and therefore these polymorphisms were selected as random biallelic markers in the vicinity of the 
Apo E gene. The sequences of both alleles of these biallelic markers (99-344/439 ; 89-355/219 ; 99-359/308 ; 99- 
365/344 ; 99-366/274) correspond to SEQ 10 Wos: 301-305 and 307-311 (See the accompanying Sequence Listing and 
TablelO) . Corresponding pairs of amplification primers for generating emplicons containing these biallelic markers can 
be chosen from those listed as SEQ ID Nos: 313 317 and 319-323. 

An additional pair of primers (SEQ ID Nos: 318 end 324) was designed that allows amplification of the 
genomic fragment carrying the biallelic polymorphism corresponding to the ApoE marker (99-2452/54: C/T; The C allele 
is designated SEQ ID NO: 30S in the accompanying sequence listing, while the T allele is designated SEQ ID NO: 312 in 
the accompanying Sequence listing; (See also Table 10). publicly known as Apo £ site A (Weisgraber et eh (1981), 
$vpnr t Rail et af. (1982) r suprd) to be amplified. 

The five random biallelic markers plus the Apo E site A marker were physically ordered by PGR screening of the 
corresponding ampBcons using all available BACs originally selected from the genomic DNA libraries, as previously 
described, using the public Apo E marker stSG94. The amplicon's order derived from this BAC screening is as follows: 

(99-344/99-368) - (99-365/99-2452) - 99-359 • 99-355. 
where brackets indicate that the exact order of the respective amplicons couldn't be established. 

Linkage disequilibrium among the six biallelic markers (five random markers plus the Apo E site A) was 
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determined by genotyping the same 100 unrelated individuals from whom the random biallelic markers were identified. 

DNA samples and amplification praducts from genomic PCR were obtained in similar conditions as those 
described above for the generation of biallelic markers, and subjected to automated microscquencing reactions using 
fluorescent ddNTPs (specific fluorescence for each ddNTP) and the appropriate microsequencing primers having a 3' end 
5 immediately upstream of the polymorphic base in the bisftelic markers. The sequence of these microsoquencing primers is 

indicated within the corresponding sequence listings of SEQ ID Nos; 325-330. Once specifically extended at the 3' end 
ty n DNA polymerase using the complementary fluorescent didcoxynucleotide analog (thermal cycling), the 
microscquencing primer was precipitated to remove the unincorporated fluorescent ddNTPs. The reaction products wuru 
analyzed by electrophoresis on ABI 377 sequencing machines. Results were automatically analyzed by an appropriate 

10 software further described in Example 13. 

Linkage disequilibrium (LD) between all pairs of biallelic markers (Mi, Mj) was calculated for every allele 
combination (Mil.Mjl ; Mi1,Mj2 ; Mi2,Mj1 ; Mi2.Mj2) according to the maximum likelihood estimate (MLE) for delta (the 
composite linkage disequilibrium coefficient). The results of the LD analysis between the Apo E Site A marker and the 
five new biallelic markers (99.344/439 ; 99355/219 ; 99-359/308 ; 99-3G5/344 ; 99-366/274| are summariied in Table 

15 2 below: 



Table 2 

Markers d x 100 SEQ ID Nos of tho SEQ ID Nos of the 

biallotic Marlco/s amplification Primers 





ApoE SitoA 
93-2452/54 


306 
312 


<f 

318 
324 


99-344/439 


1 


301 


313 






307 


319 


99*366/274 


1 


305 


317 






311 


323 


99-365(344 


8 


304 


318 






310 


322 


33-353/308 


2 


303 


315 






309 


321 


93-355/219 


1 


302 


314 






308 


320 



35 



The above ID results indicate that among the five biallelic markers randomly selected in a region of about 200 
kb containing the Apo E gene, marker 99-365/344T is in relatively strong linkage disequilibrium with the Apo £ site A 
allele (99*2452/540. 
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Thersfore, since the Apo E site A allele is associated with Alzheimer's disease, one can predict that the T allele 
of marker 99-355/344 will probably be found associated with AD. In order to test this hypothesis, the biallelic markars 
of SEQ ID Nos : 301-306 and 307-312 were used in association studies as described below. 

225 Alzheimer's disease patients were recruited according to clinical inclusion criteria based on the MMSC 
test The 248 control cases included in this study were both ethnically- and nycrnatchod to the affected ensns. Bnth 
affected and control Individuals corresponded to unrelated cases. The identities of the polymorphic bases of each of the 
biallelic markers was determined in each of these individuals using the methods described above. Techniques for 
conducting association studies are further described below. 

The results of this study are summarized in Table 3 below : 

Tablo 3 



MARKER 



ASSOCIATION DATA 



Difference in allele frequency 
between individuals with Alzheimer's 
and control individuals 



Corresponding p-valua 



99-344/439 
99-366/274 
99-3G5J344 
39-2452/54 JApoESitaA! 
99-359/308 
99-355/219 



3.3% 
1.6% 
17.7% 
23.8% 
0.4% 
2,5% 



9.54 E-02 
2.09 E-01 
6.9 MO 
3.95 E-21 

2.54 E-01 



The frequency of the Apo E site A allele in both AO cases and controls was found in agreement with that 
previously reported (ca, 10% in controls and ca. 34% in AD cases, leading to a 24% difference in allele frequency), thus 
validating the Apo E e4 association in the populations used for this study, 

Moreover, as predicted from the LO analysis (Table 2), a significant association of the T allele of marker 99- 
365/344 with AD cases (18% increase in the T allele frequency in AD cases compared to controls, p value for this 
difference - 6.9E-10) was observed. 

The above results indicate that any marker in LD with one given marker associated with a trait will be 
associated with the trait. It will be appreciated that, though in this case the ApoE Site A marker is tho trait-causing 
allele (TCA) itself, the same conclusion could be drawn with any other non TCA marker associated with the studied trait. 

These results further indicate that conducting association studies with a set of biallelic markers randomly 
generated within a candidate region at a sufficient density (here about one biallelic marker every 40kh on average), 
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allows the identification of 8t least one marker associated with the trait 

In addition, these results correlate with the physical order of the six biallelic markers contemplated within the 
present example (sec above) : marker 99*365/344, which had bcun found to be the closest in terms of physical distance 
to the ApoE Site A marker, also shows the strongest LD with the Apo E site A marker. 

In order to further refine the relationship between physical distance end finknrjc disequilibrium between biallelic 
markers, a ca, 400 kb fragment from a genomic region on chromosome 8 was fully sequenced, 

LD within co* 230 pairs of biallelic markers derived therefrom was measured in a random French population 
and analyzed as a function of the known physical intcr-marker spacing. This analysis confirmed that, on average, LD 
between 2 biallelic markers correlates with the physical distance that separates them. It further indicated that LD 
between 2 biallelic morkers tends to decrease when their spacing increases. More particularly, LD between 2 biallelic 
markers tends to decrease when their inter-marker distance is greater than 5Qkb, and is further decreased when the 
intermarker distance is greater th3n 75kb. !t was further observed that when 2 biallelic markers were further than 
150kb apart, most often no significant LD between them could be evidenced. It will be appreciated that the size and 
history of the sample population used to measure LD between markers may influence the distance beyond which LD 
tends not to be detectable. 

Assuming that LD can be measured between markers spanning regions up to on average of 150kb long, biallelic 
marker maps will allow genome-wide LD mapping, provided they have an average inter-marker distance lower than 



Genome-wide LD mapping aims at identifying, for any TCA being searched, at least one biallelic marker in LD 
with said TCA. Preferably, in order to enhance the power of LD maps, in some embodiments, the biallelic markers therein 
have average intermarker distances of 150kb or less, 75 kb or less, or 50 kb or less, Sokb or less, or 25kb or less to 
accommodate the fact that, in soma regions of the genome, the detection of LD requires lower intur-marker distances. 

The present invention provides methods to generate biallelic marker maps with average intcr-marker distances 
of 150kb or less* In some embodiments, the mean distance between biallelic markers constituting the high density map 
will be less than 75kb f preferably less than 5Dkb. Further preferred maps according to the present invention contain 
markers that 8T8 less than 37.5kb apart In highly preferred embodiments, the average inter-marker spacing for the 
biallelic markers constituting very high density maps is less than 30kb, most preferably less than 25kb. 

Genetic maps containing biallelic markers (including the 653 biallelic markers obtained above, which include the 
sequences oi SEQ iu nos. t-50 and 51-100 or the sequences complementary thereto) may be used to identify and 
isolate genes associated with detactable traits. The use of the genetic maps of the present invention is described in 
more detail below. 



One embodiment of the present invention comprises methods for identifying and isolating genes associated 
with a detectable trait using the biallelic marker maps of the present invention. 

to the past, the identification of genes linked with detectable traits has relied on a statistical approach called 



150kb. 



Use of the Hinh Density Biallelic Marker Mao to Identify 
Genes Associated with a Detectable Trait 
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linkage analysis. Linkage analysis is based upon establishing a correlation between the transmission of genetic markers 
and that of a specific trait throughout generations within a family, in this approach, all members of a sw'm of affoctcd 
families are genotyped with a few hundred markers, typically microsatellitc markers, which are distributed at an average 
density of one every 10 Mb. By comparing genotypes in all family members, one can attribute sets of alleles to parental 
haploid genomw (hapiotyping or phase determination). The origin of rccumbined fragments is then determined in the 
offspring of all families. Those thai co-segregate with the trait ore tracked. After pooling data from all families, 
statistical methods are used to determine the likelihood that the marker and the trait are segregating independently in all 
families. As a result of the statistical analysis, one or several regions having a high probability of harboring a gene linked 
to the trait are selected as candidates for further analysis. The result of linkage analysis is considered as significant (i.e. 
there is a high probability that the region contains a gene involved in a detectable trait) when the chance of independent 
segregation of the marker and the trait is lower than 1 in 100Q (expressed as a LOD score > 3). Generally, the length 
of the candidate region identified using linkage analysis is between 2 and 20Mb. 

Once a candidate region is identified as described above, analysis of recombinant individuals using additional 
markers allows further delineation of the candidate linked region. 

Linkage analysis studies have generally relied ontho use of a maximum of 5,000 microsotBliits markers, thus 
limiting the maximum theoretical attainable resolution of linkage analysis to ca. 6Q0 kb on average. 

Linkage analysis has been successfully applied to map simple genetic traits that show clear Mendelian 
inheritance patterns and which have a high penetrance (penetrance is the ratio between the number of trait positive 
carriers of allele 3 and the total number of a carriers in the population). About 1 00 pathological trait-causing genes were 
discovered using linkage analysis over the last 10 years. Inmost uf these cases, the majority of affected individuals had 
affected relatives and the detectable trait was rare in the general population (frequenciesless than 0.1%). In about 10 
cases, such as Alzheimer's Disease, breast cancer, and Type II diabetes, the detectable trait was more common but the 
allele associated with the detectable trait was rare in the affected population. Thus, the alleles associated with these 
traits were not responsible for the trait in all sporadic cases. 

Linkage analysis suffers from a variety of drawbacks. First, linkage analysis is limited by its reliance on the 
choice of a genetic model suitable for each studied trait. Furthermore, as already mentioned, the resolution attainable 
using linkage analysis is limited, and complementary studies ere required to refine the analysis of the typical 2Mb to 
20Mb regions initially identified through linkage analysis. 

In addition, linkage analysis approaches have proven difficult when applied to compiex genetic traits, such as 
thosa due to the combined action of multiple genes and/or environmental factors. In such cases, too large an effort and 
cost are needed to recruit the adequate number of affected families required for applying linkage analysis to these 
situations, as recently discussed by Risch, N. and Merikangas, K. [Sciencs 273:1516-1517 (1996), the disclosure of 
which is incorporated herein by reference). 

Finally, linkage analysis cannot be applied to the study of traits for which no large informative families are 
available. Typically, this will be the case in any attempt to identify trait-causing alleles involved in sporadic cases, such 
as alleles associated with positive or negative responses to drug treatment 
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The present genetic maps and biallelic markers (including the 653 biallelic markers obtained above, which 
include the sequences of SEQ ID Nos. 1-50 end 5M0Q or the sequences complementary thereto) may be used to 
identify and isolate genes associated with detectable traits using association studies, an approach which does not 
require the use of affected families and which permits the identification of genes associated with sporadic traits. 

Association studies are described in more detail below. 

Association Studies 

As already mentioned, any gene responsible or partly responsible for a given trait will be in ID with some 
flanking markers. To map such a gene, specific alleles of these flanking markers which are associated with the rjune or 
gangs responsible for the trait are identified. Although the fnlJuwino discussion of techniques for finding the geno or 
genes associated with a particular trait using linkage disequilibrium mapping refers to locating a single gene which is 
responsible for the trait it will be appreciated that the same techniques may also be used to identify genes which are 
partially responsible for the trait. 

Association studies may be conducted within the general population (as opposed to tha linkatje analysis 
techniques discussed abovo which are limited to studies performed on related individuals in one or several affected 
families). 

Association between a biallelic marker, A and a trait T may primarily occur as a result of three possible 
relationships between the biallelic marker and the trait. 

First allele a of biallelic marker A may be directly responsible for trait T Icq., Apo E e4 site A and Alzheimer's 
disease). However, since the majority of the biallcfic markers used in genetic mapping studies are selected randomly, 
they mainly map outside of genes. Thus, the likelihood of allele «? being a functional mutation directly roloted to trait T is 
very low. 

Second, an association bstween a biallelic marker A and a trait T may also occur when the biallelic marker is 
very closely linked to the trait locus. In other words, an association occurs when allele a is in linkage disequilibrium with 
the trait-causing allele. When tha biallelic marker is tn dose proximity to a gene responsible for the trait, more extensive 
genetic mapping will ultimately allow a gene to be discovered near the marker locus which carries mutations in people 
with trait T (i.e. the gene responsible for the trait or one of the genes responsible for the traill. As will be further 
exemplified below, using a group of biallefc markers which are in close proximity to the gene responsible for the trait the 
location of the causal gene can be deduced from the prof fla of the association curve between the biallelic markers and 
the trait. The causal gene will usually be found in the vicinity of the marker showing the highest association with the 
trait. 

Finally, an association between a biaDolic marker and a trait may occur when people with the trait and people 
without the trait correspond to genetically different subsets of the population who, coincidental^, also differ in the 
frequency of allele a (population stratification). This phenomenon may be avoided by using ethnically matched large 
heterogeneous samples. 

Association studies are particularly suited to the efficient identification of genes that present common 
polymorphisms, and are involved in multifactorial traits whose frequency is relatively higher than that of diseases with 
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monofactorial inheritance. 

Association studies mainiy consist of four steps: recruitment of trait-positive fT +) antf trait-nogaiive (T-) 
populations with well-defined phenotypes, identificntion of a candidate region suspected of harboring a trait causing 
gene, identification of said gene among candidate genus in the region, and finally validation of mutation(s) responsible for 
the trait in said trait causing gene. 

In a first step, trait+ ond trait - phenotypes have to be well-defined. In order to perform efficient and 
significant association studies such as those described heroin, the trait under study should preferably follow a bimodal 
distribution in the population under study, presenting two clear non-overlapping phenotypes, trait + and trait 

Nevertheless, in the absence of such a bimodal distribution (as may in fact be the case for complex genetic 
traits}, any genetic trail may still be analyzed using the association method proposed herein by carefully selecting the 
individuals to be included in tlte trait + and trait - phenotypic groups. The selection procedure involves selecting 
individuals at opposite ends of the non-bimodal phenotype spectrum of the trait under study, so as to include in these 
trait + and trait - populations individuals who clearly represent non-overlapping, preferably extreme phenotypes. 

The definition of the inclusion criteria for the trait + and trait - papulations is an important aspect of the 
present invention. The selection of tliosc drastically different but relatively uniform phenotypes enables efficient 
comparisons in association studies and the possible detection of marked differences at the genetic level, provided that 
the sample sizes of the populations under study are significant enough. 

Generally, trait + and trait - populations to be included in association studies such as those proposed in the 
present invention consist of phenotypically homogeneous populations of individuals each representing 100% of the 
corresponding phenotype if the trait distribution is bimodal. if the trait distribution is non-bimodal, trait + and trait - 
populations consist of phenotypically uniform populations of individuals representing each between 1 ond 98%, 
preferably between 1 and 80%, more preferably between 1 and 50%, and more preferably between 1 and 30%, most 
preferably between 1 and 20% of the total population under study, and selected among individuals exhibiting non- 
overlapping phenotypes. In some embodiments, the V and T groups consist of individuals exhibiting the extreme 
phenotypes within the studied population. The clearer the difference between the two trait phenotypes, the greater the 
probability of detecting an association with biallelic markers. 

in preferred embodiments, a first group of between 50 and 300 trait + individuals, preferably about 100 
individuals, are recruited according to their phenotypes. In each case, a similar number of trait negative individuals are 
included 5n such studies who are preferably fjqth ethnically- and age-matched to the trait positive cases. Both wait + and 
trait - individuals should correspond to unrelated cases. 

Figure 3 shows, for a series of hypothetical sample sizes, the p-value significance obtained in association 
studies performed using individual markers from the high-density biallelic map, according to various hypotheses regarding 
the difference of allelic frequencies between the T+ and T- samples. It indicates that in all cases, samples ranging from 
150 to 500 individuals are numerous enough to achieve statistical significance. It will be appreciated that bigger or 
smaller groups can be used to perform association studies according to the methods of the present invention. 

In a second step, a marker/trait association study is performed that compares the genotype frequency of each 
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bia&elic marker in the above dascribcd T+ and T- populations by means of a chi square statistical test (one degree of 
freedom). In addition to this single marker association analysis, a haplotype association analysts is performed to define 
the frequency and the type of the ancestral carrier haplotype. Haplotype analysis, by combining the mformativeness of a 
set of biallefic markers increases the power of the association analysis, allowing false positive and/or negative data that 
5 may result from the single marker studios to be eliminated. 

Genotyping can be performed using the microsequencing procedure described in Example 13, or any other 
gsnotyping procedure suitable for this intended purpose. 

If a positive association with a trait is Identified using an array of biallolic markers having a high enough 
density, the causa] gene will be physically located in the vicinity of the associated markers, since the markers showing 

10 pos'rtiva association with the trait are in linkage disequilibrium with the trait locus. Regions harboring a gene responsible 

for a particular trait which are identified through association studies using high density sots of biallclic markers wilt, on 
average, be 20-40 times shorter in length than those identified by linkage analysis. 

Once a positive association is confirmed as described above, a third step consists of completely sequencing the 
SAC inserts harboring the markers identified in the association analyzes. These BACs are obtained through screening 

15 human genomic libraries with the markers probes and/or primers, as described above. Once a candidate ronton has been 

sequonced and analyzed, the functional sequences within the candidate region (e.g. exons, splice sites, promoters, and 
other potential regulatory regions) are scanned for mutations which are responsible for the trait by comparing the 
sequences of tfie functional regions in a selected number of T+ and T- individuals using appropriate software. Tools for 
sequence analysis arc further described in Example 14. 

20 Finally, candidate mutations arc then validated by screening a larger population of T+ and 

T- individuals using genotyping techniques described below. Polymorphisms arc confirmed as 
candidate mutations when the validation population shows association results compatible with those 
found between the mutation and the trait in the test population. 

In practice, in order to define a region bearing a candidate gene, the trait + and trait • populations are 

25 genotypad using an appropriate number of bialielic markers. The markers may include one or more of the 653 markers 

obtained above (which include the sequences of $EQ ID Nos: 1-50 and 51*100 or the sequences complementary thereto. 

The markers used to defino a region bearing a candidate gene may be distributed at an average density of 1 
marker per 10-200 kb. Preferably, the markers used to define a region bearing a candidate gene arc distributed at an 

30 average density of 1 marker every 15-150 kb. In further preferred embodiments, the markers used to define a region 

bearing a candidate gene are distributed at an average density of 1 marker every 20*1 QQkb, In yet another preferred 
embodiment, the markers used to define a region bearing a candidate gene are distributed at an average density of 1 
marker every 100 to 15Qkb. In a further highly preferred embodiment, the markers ussd to define a region bearing a 
candidate gene are distributed at an average density of 1 marker every 50 to 100kb. In yet another embodiment, the 

35 bialielic markers used to define a region bearing a candidate gene are distributed at an avsraga density of 1 marker every 

25-50 kilobases. As mentioned above, in order to enhance the power of linkage disequilibrium based maps, in a preferred 
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embodiment, the marker density of the map will be adapted to take the linkage disequilibrium distribution in the genomic 
region of interest into account. 

In some embodiments, tho initial identification of a candidate Genomic region harboring a gene associated with 
a detectable phenotype may be conducted using a preliminary map containing a few thousand biallelic markers. 
Thereafter, the genomic region harboring the gens responsible for the detectable trait may bu better delinaatod using a 
map containing a larger number of biallelic markers. Furthermore, the genomic region harboring the gene responsible for 
the detectable trait may bo further delineated using a high density map of biallelic markers. Finally, the gene associated 
with the detectable trait may ba identified and isolated using a very high density bialluiic marker map. 

Example 11 describes a hypothetical procedure for identifying a candidate region harboring a gene associated 
with a delectable trait. It will be appreciated that although Example 11 compares the results of analyzes using markers 
derived from maps having 3,000, 20,000. and 60,000 markers, the number of markers contained in the map is not 
restricted to these exemplary figures. Rather, Example 1 1 exemplifies the increasing refinement of the candidate region 
with increasing marker density. As increasing numbers of markers ore used in the analysis, points in the association 
analysis become broad peaks. The gene associated with the detectable trait under investigation will lie within or near 
the region under the peak. 



The initial identification of a candidate genomic region harboring a gene associated with a detectable trait may 
he conducted using a genome-wide map comprising about 20,000 biallelic markers. The candidate genomic region may 
be further defined using a map having a higher marker density, such as a map comprising^about 40,000 markers, about 
60,000 markers, about 80,000 markers, about 100,000 markers, or about 120,000 markers. 

The use of high denshy maps such as those described above allows the identification of genes which are truly 
associated with detectable traits, since the coincidental associations will ba randomly distributed along tho genome 
while tho true associations will map within one or mora discrete genomic regions. Accordingly, biallelic markers located 
in the vicinity of a gene associated with a detectahle trait wilt give rise to broad peaks in graphs plotting the frequencies 
of the biallelic markers in T+ individuals versus T- individuals. In contrast, biallelic markers which are not in the vicinity 
of the gene associated with the detectable trait will produce unique points in such a plot. By determining the 
association of several markers within the region containing the gene associated with the detectable trait, the gene 
associated with the detectable trait can be identified using an association curve which reflects the difference between 
the allele frequencies within the T+ 8nd T- populations for each studied marker. The gene associated with the 
detectable trait will be found in the vicinity of the marker showing the highest association with the trait. 

Figures 4, 5, and 6 illustrate the above principles. As illustrated in Figure 4, an association analysis conducted 
with a map comprising about 3,000 biallelic markers yields a group of points. However, when an association analysis is 
performed using a denser map which includes additional biallelic markers, the points become broad peaks indicative of 
the location of a gene associated with a detectable trait For example, the biallelic markers used in the initial association 
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analysis may be obtained from a map comprising about 20,000 biallelic markers, as illustrated in Figure 5, In some 
embodiments, one or more of the 653 biallelic markers obtained above {which include the sequences of SEQ ID Nos. 1-50 
and 51-100 or the sequences complementary thereto) arc used in the association analysis. 

In the hypothetical example of Figure 4, the association analysis with 3,000 markers suggests peaks near 
5 markers 9 and 17- 

Next, a second analysis is pcrf ormod using additional markers in the vicinity of markers 9 and 1 7, as illustrated 
in the hypothetical example of Figure 5, using a map of about 20.000 markers. This step again indicates an association 
in the close vicinity of marker 17, since more markers in this region show an association with the trait. However, none 
of the additional markers around marker 9 shows a significant association with the trait, which makes marker 9 a 

10 potential false positive* In some embodiments, one or more of tlte 653 biallelic markers obtained obove [which include 

the sequences of SEQ ID Nos. 1-50 and 51-100 or the sequences complementary thereto) are used in the second 
analysis. In order to further test the validity of these two suspected associations, a third analysis may be obtained with 
a map comprising about 00,000 biallelic markers. In some embodiments, one or more ot the 653 biallelic markers 
obtained above are used in the tliird association analysis, in the hypothetical example of Figure 0, more markers lying 

15 around marker 17 exhibit a high degree of association with the detectable trait. Conversely, no association is confirmed 

in the vicinity of marker 9. The genomic region surrounding marker 17 can thus be considered a candidate region for the 
hypothetical trait of this simulation. 

The statistical power of LD mapping using a high density marker map is also reinforced by complemnnting the 
single point association analysis described above with a multi-marker association analysis, called haplotype analysis. 

20 When a chromosome carrying a disease allele is first introduced into a population as a result of either mutation 

or migration, ti« mutant allele necessarily resides on a chromosome having a unique set of Jinked markers: the ancestral 
haplotype. As already mentioned, a hoplotype association analysis allows the (requency and the type of the ancestral 
carrier haplotype to be defined. 

A haplotype analysts is performed by estimating the frequencies of all possible haplotypes for a given set of 

25 biallelic markers in the T+ and T- populations, and comparing these frequencies by means of a chi square statistical test 

(one degree of freedom). Haplotype estimations are usually performed by applying the Expectation-Maximization (EM) 
algorithm (Exeaffier I and Slatkin M, Mol B'ml £wl 12:921-927 (1995). the disclosure of which is incorporated herein 
by reference), using the EM-HAPLO program (Hawley ME r Pakstis AJ & Kidd KK,Am. J. Phys. Anthropoi 18:104 
(1994), the disclosure of which is incorporated herein by reference). The EM algorithm is used to estimate haplotype 

30 frequencies in the case when only genotype data from unrelated individuals arB available. The EM algorithm is a 
generalized iterative maximum likelihood approach to estimation that is useful when data are ambiguous and/or 
incomplete. 

To improve the statistical power of the individual marker association analyses conducted as described above 
using maps of increasing marker densities, haplotype studies can be performed using groups of markers located in 
35 proximity to one another within regions of the genome. For example, using the methods described above in which the 

association of an individual marker with a detectable phenctype was analyzed using maps of 3,000 markers, 20 f OQO 
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markers, and 60,000 marker, a series of haplotypc studies can be performed using groups of contiguous markers from 
such maps or from maps having hi(jher marker densities. 

In a preferred embodiment, a scries of successive baplotype studies including groups of markers spanning 
regions of more than 1 Mb may be performed. In some embodiments, the biallelic markers included in ench of these 
groups may be located within a genomic region spanning less than Ikb, from 1 to 5kb, from 5 to 10kh ( from 10 to 25kb, 
from 25 to 50kb, from 50 to 150kb, from 150 to 2b"Qkb, from 250 to 500kb, from EOOkb lo 1Mb. or more Hum 1Mb. 
Preferably, the genomic regions containing the groups of biallelic markers used in (he successive haplolype analyses are 
overlapping. It will be appreciated that the groups of biallelic markers need not completely cover the genomic regions of the 
above-specified lengths but may instead be obtained from incomplete contigs having one or more gaps therein. As discussed 
in further detail below, biaHefic markers may be used in single point and haplotype association analyses regardless of tho 
completeness of the corresponding physical contig harboring them. 

Without wishing to be limited to any particular numerical value, it is believed that those haplotypes displaying a 
coefficient of relative risk above 1, preferably about 5 or more, preferably of about 7 or more arc indicative of a 
•significant risk" for the individuals carrying the identified haplotypc to develop the given trait. However, it is difficult to 
evaluate accurately quantified boundaries for the so-called "significant risk - . Indeed, and as it has been demonstrated 
previously, several traits observed in a given population are multifactorial in that they are not only the result of a single 
genetic predisposition but also of other factors such as environmental factors. Thus, the evaluation of a significant risk 
must take these parameters into consideration in order to, m a certain manner, weigh the potential importance of 
external parameters in the development of a given trait. Thus, tho relative risk which constitutes a "significant risk" to 
develop a given trait is evaluated differently depending on the trait under consideration and the populations tested. 

Genome wide mapping using association studies with dense enough anays of markers permit a casq-by-case 
best estimate of p-value significance thresholds. Given a lest population comprising two ethnically matched trait 
positive and trait negative groups of about 50 to about 500 individuals or more, conducting the above described 
association studies will allow a p-value "cut-off* to be established by, for example, analyzing significant numbers of 
allele frequency differences or, in some cases where appropriate, running computer simulations or control studies as 
described in Examples It, 20, and 31* 

For a p-valua above the threshold, o corresponding association between the trait and a studied marker will be 
deemed not significant, while for a p-value below such a threshold, said association will be deemed significant. If the p- 
value is significant, the genomic region arround the marker will be further scrutinized for a trait-causing gene, 

It is preferred that p-valuo significance thresholds he assessed for each case/control population comparison. 
Both the genetic distance between sampled population-'stratification'-and the dispersion due to random selection of 
samples may indeed influence the p-value significance thresholds. 

It will be appreciated that the above approaches may be conducted on any scale (i.e. over the whola genome, a 
set of chromosomes, a single chromosome, a particular subchromosomal region, or any other desired portion of thB 
genome). As mentioned above, once significance thresholds have been assessed, population sample sizes may be 
adapted as exemplified in Figure 3. 
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Example 12 below illustrates ihs increase in statistical power brought to an association study by a haplotype 

analysis. 



As shown in Table 3 within Example 10, at an average map density ol one marker pur 40 kb only one marker 
(99-365/344 ) out of five random biallelic markers from a ca. 200 kb genomic region around the Apo E gene showed a 
dear association to AO (delta allelic frequency in cases and controls - 18% ; p value - 6.9 E-10). The allelic frequencies 
of the other four random markers were not significantly different between AD cases and controls (p-values £ E*01 >. 
However, since linkage disequilibrium can usually be detected between markers located further apart than an averaga 40 
kb as previously discussed, one should expect that, performing an association study with a local excerpt of n biallelic 
marker map covering ca. 2Q0kb with an average inter-marker distance of ca. 4Gfcb should alfow the identification of 
more than one biaOelic marker associated with AD. 

A haplotype analysis was thus performed using the biallelic markers 99-344/439; 99-355/219; 99-359/308 ; 
99-365/344 ; and 99-366/274 (of SEQ ID Nos: 301-305 and 307*31 1). 

In a first step, marker 99-365/344 that was already found associated with AD was not included in the 
haplotype study. Only biaOelic markers 99-344/439 ; 99-355/219 ; 99-359/308 ; and 99-356/274. which did not show 
any significant association with AD when taken individually, were used. This first haplotype analysis measured 
frequencies of all possible two-, three-, or four-marker haplotypes in the AD case and-control populations. As shown in 
Figure 7, there was one haplotype among atl the potential different haplotypes based on the four individually non- 
significant markers ("haplotype 8", TAGG comprising SEQ ID No. 305 which is tho T allele of marker 99-366/274, SEQ 
ID No. 301 which is the A allele of marker 99-344/439, SEQ ID No. 303 which is the G allele of marker 99-359/308 and 
SEO ID No, 302 which is the G allele of marker 99-355/2191 that was present at statistically significant different 
frequencies in the AD case and control populations (A- 12% ; p value - 2.05 E-QS). Moreover, a significant difference 
was already observed for a three-marker haplotype included in the above mentioned "haplotype 8* ("haplotype 7\ TGG, 
A-10% ; p value - 4.76 E-05J. Haplotype 7 comprises SEQ ID No. 305 which is the T allele of marker 99-366/274, 
SEQ ID No, 303 which is the 6 allele of marker 39*359/308 and SEQ !D No. 302 which is the G allele of marker 99- 
3551219). The haplotype association analysis thus clearly increased the statistical power of the individual marker 
association studies by more than four orders of magnitude when compared to single-marker analysis (from p values £ E- 
01 for the individual markers - see Tabic 3 ■> to p value £ 2 E-06 for the four-marker "haplotype 8"). 

The significance of the values obtained for this haplotype association analysis was evaluated by the following 
computer simulation. The genotype data from the AD cases and the unaffected controls were pooled and randomly 
allocated to two groups which contained the same number of individuals as the case/control groups used to produce the 
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data summarired in Fi$m 7. A four-marker haplotype analysis {99-344/439; 99*355/219; 99-359/308 ; and 99- 
368/274) was run on these artificial groups. This experiment was reiterated 100 times and the results arc shown in 
Figure 8, No haplotype among those generated was found for which the p-value of the frequency difference between 
both populations was more significant than 1 E05. In addition, only 4% of the generated haplotypes showed p-vaiuos 
lower than 1 E-04. Since both these p-valuc thresholds are less significant than the 2 E-OG p-value showed by 
"haplotype 8", this haplotype can bo considered significantly associated with AO. 

In a second step, marker 99-365/344 was included in the haplotype analyzes. Tha frequency differences 
botween the affected and non affected populations was calculated lor all two-, three-, four- or five-marker haplotypes 
involving markers: 99-344/439; 99-355/219; 99-359/308; 99-3CD/274; and 99-3G5/344. The most significant p. 
values obtained in each category of haplotype (involving two, three, four or five markers) were examined dcpmidino on 
which markers were involved or not within the haplotype. This shewed that all haplotypes which included marker 99* 
3G5/344 showed a significant association with AO {p*vaiuesin tha range of E-04 to E-1 1). 

An additional way of evaluating the significance of the values obtained in the haplotype association analysis 
was to perform a similar AD case-control study on bialtelic markers generated from BACs containing inserts 
Corresponding to genomic regions derived from chromosomes 13 or 21 and not known to be involved in Alzheimer's 
disease. Performing similar haplotype and individual association analyzes as those described above and in Example 10 
did not generate any significant association results (all p-values for haplotype analyzes were less significant than E-03; 
all p-values for single marker association studies were tess significant than E-02I. 

Tha results described in Examples 10 and 12, generated from individual and haplotype studies using a biallelic 
marker set ol an average density equal to ca. 40kb in the region of an Alzheimer's disease trait causing gene, indicate 
that all biallelic markers of sufficient informative content located within a ca. 200 kb genomic region around a TCA can 
potentially he succesfully used to localize a trait causing gene with the mBthods provided by the present invention. This 
conclusion is further supported by the results obtained through measuring the linkage disequilibrium between markers 
99-365/344 or 99-359/308 and ApoE 4 Site A marker within Alzheimer's patients: as one could predict since LD is the 
supporting basis for association studies, LD between these pairs of markers was enhanced in the diseased papulation vs. 
the control population. In a similar way as the haplotype analysis enhanced the significance of the corresponding 
association studies. 

Once a given polymorphic site has been found and characterized as a biallelic marker according to the methods 
of the present invention, several methods can be used in order to determine the specific allele carried by an individual at 
the given polymorphic base. 

In some embodiments, genotyptng will be applied to one or mora of the markers of SEQ ID Nos: 301-305 and 
307-31 1 or the sequences complementary thereto. In additional embodiments, genotyping will be applied to the marksrs 
of SEQ ID Nos. 3QS and 312 as well as one cr more of the markers of SEQ 10 Nos. 301-305 and 307-31 1. In some 
embodiments, genotyping will be applied to one or more of the 653 biallalic markers obtained above (which include the 
sequences of SEQ 10 Nos. 1-50 and 51-100 or the sequences complementary thereto). The present invention further 
contemplates the genotyping of any biallelic marker within the provided maps, including those that are in linkage 
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disequilibrium with the 653 biallelic markers obtained above (which include the sequences of SEQ ID Nos. 1-50 and 51- 
100 or the sequences complementary thereto) or the markers of SEQ ID Nos. 301-312 or the sequences complementary 
thereto. 

Most genotyping methods require the previous amplification of 8 DNA region carrying the polymorphic site of 

interest. 

The identification of biallolic markers described previously, allows the design of appropriate oligonucleotides, 
which can ba used as primers to ampKfy a DNA fragment containing the polymorphic site of interest and for the 
detection of such polymorphisms. 

In particularly preferred embodiments, pairs of primers of SEQ 10 Nos: 313-318 and 319-324 may ho used to 
Generate ampiicens harboring the mnrkers of SEQ ID Nos: 301-30G/307-312 or the sequences complementary thereto. In 
further embodiments, pairs of amplification primers may ba used to generate amplicons harboring the 653 mnrkers 
obtained above (which include the sequences of SEQ ID Nos. 1*50 and 5M00 or the sequences complementary thereto. 
In highly preferred embodiments, pairs of the amplification primers of SEQ ID Nos: 101-150 and 151-200 mny he used 
to generate amplicons harboring the markers of SEQ ID Nos: 1-50 and 51-100 or the sequences complementary thereto. 

It will be appreciated that amplification primers may bo designed having any length suitable for their intended 
purpose, in particular any length allowing their hybridization with a region of the DMA fragment to be amplified. 

It will be further appreciated that the hybridization site of said amplification primers may be located at any 
distance from the polymorphic base to be genotyped, provided said amplification primers allow tho proper amplification 
of a QNA fragment carrying said polymorphic site. The amplification primers may be oligonucleotides of 10, 15, 20 or 
more bases in length which enable the amplification of the polymorphic site in the markers. In some embodiments, the 
amplification product produced using these primers moy be at least 100 bases in length (Lb. on average 50 nucleotides 
on each side of the polymorphic base). In other embodiments, the amplication product produced using these primers 
may be at least 500 bases in length (i.e. on average 250 nucleotides on each side of the polymorphic base). In still 
further embodiments, the amplification product produced using these primers may be at least 100O bases in length (i.e. 
on average 500 nucleotides on each side of the polymorphic base). 

The amplification of polymorphic fragments can be carried as described in Example 6 on DNA samples 
extracted as described in Example 5. 

As already mentioned, allele frequencies of biallclic markers tested in association studios (individual or 
haplotype) may be determined using microsequencing procedures, 

A first $tcp in microsequencing procedures consists in designing microscquoncing primers adapted to each 
biallelic marker to be genotyped. Microsequencing primers hybridize upstream of the polymorphic basa to he genotyped, 
either with the coding or with the non-coding strand. Microsequencing primers may be oligonucleotides of B. 10, 15, 20 
or more bases in length. Preferably, the 3' end of the microsequencing primer is immediately upstream of the 
polymorphic base of the biallelic marker being genotyped, such that upon extension of the primer, the polymorphic base 
is the first base incorporated. Such microsequencing primers are included within the scope of the present invention, 



WO 99/04038 




PCT/IB98/01193 



In preferred embodiments, the microsequencing primers arc those indicated as features within thB sequence 
listings corresponding to markers of SEQ ID Nos: 325-330/331 -336. In some embodiments, the 653 bialiclic markers 
obtained above (which include the sequences of SEQ ID Nos. 1*50 end 51-100 or the sequences complementary thereto) 
are genotyped using appropriate microsequencing oligonucleotides such as ihnsc of SEQ ID Nos. 201-250 or 251-300, 

It will ba appreciated that the biallelfc markers of the present invention may be genotyped using 
microsequencing primers having any desirable length, and hybridizing to any of the strands of the marker to ho tested, 
provided their design is suitable for their intended purpose. In some embodiments, the amplification primers or 
microsequencing primers may be labeled. For example, in some embodiments, the amplification primers or 
microsequencing primers may he biotinylatsi 

Typical microsequencing procedures that can be used in the context of thu present invention are described in 
Example 13 below. 

Example 13 

Genotyping of biallelic markers using: microsequencing procedures 
Several microsequencing protocols conducted in liquid phase are well known to those skilled in the art A first 
possible detection analysis allowing the allele characterization of the microsequencing reaction products relies on 
detecting fluorescent ddWTP- extended microsequencing primers after gel electrophoresis. A first alternative to this 
approach consists in performing a liquid phase microsequencing reaction, the analysis of which may be carried out in 
solid phase. 

For example, the microsequencing reaction may be performed using 5'-biotinylated oligonucleotide primers and 
fluorcsccin-dideoxynucleotides. The biotinylatcd oligonucleotide is annealed to tha target nucleic acid sequence 
immediately adjacent to the polymorphic nucleotide position of interest. It is then specifically cxtsndad at its 3'*cnd 
following a PCR cycle, wherein the labeled dideoxynudeotide analog complementary to the polymorphic base is 
incorporated. The biotinylatcd primer is then captured on a microtiter plate coated with streptavidin. The analysis is 
thus entirely carried out in a microtiter plate format. Tha incorporated ddNTP is detected by a fluorescein antibody • 
alkaline phosphatase conjugate,, 

In practice this microsequencing analysis is performed as follows. 20 jj\ of the microsequencing reaction is 
added to 80 pi of capture buffer /sSC 2X. 2.5% PEG 8000, 0.25 M Tris pH7.5, 1.8% BSA, 0.05% Tween 20) and 
incubated for 20 minutes on * ijfecrotitcr plate coated with streptavidin (Boehringer). The plate is rinsed once with 
' washing buffer (0.1 M Tris pH 7.5, 0.1 M NaCt, 0,1% TweBn 20). 100 }A of anti-fluorescein antibody conjugated with 
phosphatase alkaline, diluted Jlf5O0Q in washing buffer containing 1.8% BSA is added to the microtiter plats. The 
antibody is incubated on tha microliter plate for 20 minutes. After washing the microtiter plate four times, 1 00 pi of 4- 
methylumbelliferyl phosphate (Sigma) diluted to 0*4 mgfml in 0.1 M diethanolamine pH 9.6, 10mM MgCI 2 are added. The 
detection of the microsequencing reaction is carried out on a fluorimeter (OynatacW aftar 20 minutes of incubation. 

As another alternative, solid phase microsequencing reactions have been developed, for which either the 
oligonucleotide microsequencing primers or the PCtVampIified products derived from the DNA fragment of interest are 
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immobilized. For example, immobilization can be carried out via zn interaction between biotinylated DNA and 
strcptavidin-coated microtitration wells or avidin-coated polystyrene particles. 

As a further alternative, the PGR reaction generating the amplicons to bo genDtyped can be performed directly 
in solid phase conditions, following procedures such as those described in WO 96/1 3G09, the disclosure of which is 
S incorporated herein by reference. 

In such solid phase microscquencing reactions, incorporated ddNTPs can cither be radiolabeled (scu Syvancn, 
Clin, Chim. Acta. 226:225-236 (1994), the disclosure of which is incorporated heroin by reference) or linked to 
fluorescein (see Livak and Haincr, Hum. MetaL 3:379-385 (1994), the disclosure of which is incorporated herein hy 
tefmntv). The detection of radiolabeled ddNTPs can be achieved through scintillation-based techniques. The detection 
10 of flucrescein-Iinked ddNTPs can be based on the binding of antiftuorcscein antibody conjurjated with alkaline 

phosphatase, followed by incubation with a chromogenic substrate (such as p-nitrophenyl phosphate). 
Other possible reporter-detection couples for use in the ahovo microscquencing procedures include : 
ddNTP linked to dinitrophenyl {DNP) and anti-DNP alkaline phosphatase conjugate (see Harju et aL, Clin 
£»am39tl1Pt 11:2282-2287 (1993), incorporated herein by reference) 
15 biotinylatad ddNTP and horseradish peroxidaso-conjugated streptavidin with o-phenylenediamine as a substrate (see 

WO 92715712* incorporated herein by reference). 

A diagnosis kit based on ffupresccin-linked ddNTP with antifluorescein antibody conjugated with alkaline 
phosphatase has been commercialized under the name PRONTO by GarnidaGen Ltd. 

As yet another alternative nucrosaquencing procedure, Nyren et - aL [Anal. Bioclicm. 208:171-175 (1993), the 
20 disclosure of which is incorporated herein by reference) have described a solid-phase DNA sequencing procedure that 

relies on the detection of DNA polymerase activity by an enzymatic luminomctnc inorgajiic pyrophosphate detection 
assay (EUDAL In this procedure, the PCR-amplified products arc biotinylatad and immobilized on beads. The 
microsequencing primer is annealed and four aliquots of this mixture are separately incubated with DNA polymerase and 
one of the four different ddNTPs. After the reaction, tha resulting fragments are washed and used as substrates in a 
25 primer extension reaction with all four dNTPs present The progress of the DNA-directcd polymerization reactions is 

monitored with the ELIOA. Incorporation of a ddNTP in the first reaction prevents the formation of pyrophosphate during 
the subsequent dNTP reaction, in contrast, no ddNTP incorporation in the first reaction gives extensive pyrophosphate 
release during the dNTP reaction and this leads to generation of light throughout the ELIOA reactions. From the ELIDA 
results, the identity of the first base after the primer is easily deduced* 
30 It will he appreciated that several parameters of the above-described microsequencing procedures may b8 

successfully modified by those skilled in the art without unduo experimentation. In particular, high throughput 
improvements to these procedures may be elaborated, following principles such as those described further below. 

It will be further appreciated that any other genotyping procedure may bo applied to the genotyping of biallelic 

markers. 
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Once the candidate region has been delineated using the high density biallelic morker map, 3 sequence analysis 
process will allow tlic detection of all genes located within said region, together with a potential functional 
characterization of said genes. The identified functional features may allow preferred trait-causing candidates to be 
chosen from among the identified genes. More bialleiic markers may then be generated witlim said candidate genes, and 
used to perform refined association studies that will support the identification of the trait causing gene* Sequence 
analysis processes are described in Exompie 14 below. 

Example 14: Sequence Analysis 
DNA sequences, such as BAC inserts, containing the region carrying the candidate gena associated with the 
detectable trait are sequenced and their sequence is analyzed using automated software which eliminates repeat 
sequences while retaining potential gene scquances. The potential geno sequences are compared to numerous databases 
to identify potential exons using a set of scoring algorithms such as trained Hidden Markov Models, statistical analysis 
models (including promoter prediction tools) and the GRAIl neural network. Preferred databases for use in this analysis, 
the construction and use of which are further detailed in Example 22 below, include the following: 

NetGene database: 

This proprietary database contains sequences ol 5' cDNA tags, obtained from a number of tissues and cells. 
Currently more than 5G\QQ0 different 5* clones representing more than 50,000 different genes are included in NetGene* 
The sequences in the NetGene database correspond specifically to the 5' regions of transcripts (first exons) and 
therefore allow mapping of the beginning of genes within raw rjenomic sequences. 

if 

NRFU (Non-Redundant Protein-Unique) database ; 

NRPU is a non-redundant merge of the publicly available NBRF/PIR, Genpept, and SwissProt databases. 
Homologies found with NRPU allow the identification of regions potentially coding lor already known protein? or related 
to known proteins [translated exons). 

NREST (Non-Redundant EST database}: 

NREST is a merge of the EST subsection of the publicly available GcnBank database. Homologies found with 
NREST allow the location of potentially transcribed regions (translated or non-translated exons!. 

NRN (Non-Redundant Nucleic acid database): 
NRN is a merge of GenBank, EMBL and their daily updates. 

Any sequence giving a positive hit with NRPU, NREST or an "excellent" score using GRAIL or/and other scoring 
algorithms is considered a potential functional region, and is then considered a candidate for genomic analysis. 
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While this first screening allows the detection of the 'strongest'' axons, a semi-automatic scan is further 
applied to the remaining saquencos in the context of the sequence assembly. That is, the sequences neighboring a 5* 
site or an exon arc submitted to another round of bioinformatics analysis with modified parameters. In this way, new 
exon candidates are generated for genomic analysis. 



Using the above procedures, genes associated with detectable traits may be identified. 

Examples 15-23 illustZtc the application of the above methods using biallelic markers to identify a gene 
— associated with a complex dj/ease, prostate cancer, within a ca. 450 kb candidate region. Additonai details of the 
^identification of the gene associated with prostate cancer are provided in the U.S. Patent Application entitled "Pmstatc 
Cancer Gene* {GENSET.O18A, Serial No. 08/995,306), the disclosure of which is incorporated herein by reference. 



Use of Biallelic Markers to Identify a Gene Associated with Prostate Cancer 
Substantial amounts of LOII data supported the hypothesis that genes associated with distinct cancer types 
are located within a particular region of the human genome. More specifically, this region was likely to harbor a gene 
associated with prostate cancer. Association studies were performed as described below In order to identify this 
prostate cancer gene, A YAC contig containing the genomic region suspected of harboring a gene associated with 
prostate cancer was constructed as described in Example 15 below. 

S am p l e 15 

YAC ContiE Construction in the Candidate Gcnoinia Region 
First, a YAC contig which contains the candidate genomic region was constructed as follows. The CEPH- 
Genethon YAC map for the entire human genome (Chumakov ct al. (18951. supra) was used for detailed contig building in 
the genomic region containing genatic markers known to map in the candidate genomic region. Screening data available 
for several publicly available genetic markers were used to select a set of CEPH YACs localised within the candidate 
region. This set of YACs was tested by PGR with the above mentioned genetic markers as wall as with other publicly 
available markers supposedly locatBd within the candidate region. As a result of these studies, a YAC STS contig map 
was generated around genetic markers known io map in this genomic region. Two CEPH YACs were found to constitute 
a minimal tiling path in this region, with an estimated size of ca. 2 Megahases. 

During this mapping effort several publicly known STS markers were precisely located within the contig. 
Example 16 below describes the identification of sets of biallelic markers within the candidate genomic region. 

Example 16 
BAC contin construction and 
Biallelic Markers isolation within the candidate chromosomal reg ion. 
Next, a BAC contig covering the candidate genomic region was constructed as follows. BAC libraries were 
obtained as described in Woo et ai. f Nucleic Acids Res. 22:49224931 (1994), the disclosure of which is incorporated 
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herein by reference. Briefly, the two whole human genome BamHI and Hindlll libraries already described in Example 1 
were constructed using the pBcioBACH vector [Kim et al. (1996), supra), 

Tho BAG libraries were then screened with all of the above mentioned STSs, following the procedure described 
in Example 2 above. 

The ordered BACs selected by STS screening and verified by FISH, ware assembled into contigs and new 
markers were generated by partial sequencing of insert ends from some of them. These markers were used to fill the 
gaps in the contig of BAG clones covering the candidate chromosomal region having an estimated sizs of 2 megabnses. 

Figure 8 illustrates a minimal array uf overlapping clones which was chosen for further studies, and the 
positions of tho publicly known STS markers along said contig. 

Selected BAG clones from the contig were subcloned and sequenced, essentially following the procedures 
described in Examples 3 and 4. 

Biallelic markers lying along the contig were identified following the processes described in Examples 5 and 6. 

Figure 9 shows the locations of the biallelic markers along the BAG contig. This first set of markers 
corresponds to a medium density map of the candidate locus, with an inter-marker distance averaging 50kb-150kb, 

A second set of biallelic markers was then generated as described above in order to provide a very high-density 
map of the region identified using the first set of markers which can be used to conduct association studies, as 
explained below. This very high density map has markers spaced on average every 2*5Qkb, 

The biallelic markers were then used in association studies. DNA samples were obtained from individuals 
suffering from prostate cancer and unaffected individuals as described in Example 17, 



Prostate cancer patients were recruited according to clinical inclusion criteria based on pathological or radical 
prostatectomy records. Control cases included in this study were both ethnically- and age-matched to the affected 
cases; they were checked for both the absence of all clinical and biological criteria defining the presence or the risk of 
prostate cancer, and for the absence of related familial prostate cancer casBS. Both affected and control individuals 
were al! unrelated. 

The two following groups of independent individuals were used in the association studies. The first group, 
comprising individuals suffering from prostate cancer, contained 185 individuals. Of these 185 cases of prostate 
cancer, 47 cases were sporadic and 1 38 cases were familial The control group contained 1 04 non*discascd individuals. 

Haplotype analysis was conducted using additional diseased (total samples: 281) and control samples (total 
samples; 130), from individuals recruited according to similar criteria* 

DNA was extracted from peripheral venous blood of all individuals as described in Example 5, 

The frequencies of the biallelic markers in each population were determined as described in Example IB. 



Example 17 

Collection of DNA Samples from Affected and Non-affected Individuals 



Example 18 
Genotypirto Affgcted and Control Individuals 
Genotyping was performed using the following microsequendng procedure. 
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Amplification was performed on each DMA sample using primers designed as previously explained. The pairs of primers 
were used to generate amplicons harboring the biallelie markers 99-123, 4-26, 4-14, 4-77, 99-217, 4-67, 99-213, 99- 
221, 99-135, 99-1482, 4-73, and 4-65 using the protocols described in Example 8 above. 

Microsequoncing primers were designed for each of the biallelie markers, as previously described. 
After purification of the amplification products, the microscrniencing reaction mixture was prepared by adding, in a 2Q//I 
final volume: 10 pmo! microsequoncing oligonucleotide, 1 U Thermoscqucnase (Amersham E79D00G), 1.25 pi 
Thermosequenase buffer (260 mM Tris IICI pH 9.5, 65 mM MgCI 2 ), end the two appropriate fluorescent ddNTPs (Parkin 
Elmer, Dye Terminator Set 401095) complementary to the nucleotides at the polymorphic site of each biallelie marker 
tested, following the manufacturer's recommendations. After 4 minutes at 94°C, 20 PCR cycles of 15 sec at 55°C, 5 
sec at 72 A C, and 10 sec at 94°C wure carried out in a Tetrad PTC-225 thcrmocycicr (MJ Research). The 
unincorporated dye terminators were then removed by ethane! precipitation. Samples were finally resuspended in 
formamide*EDTA loading buffer and heated for 2 min at 95°C before being loaded on a potyacrylamide sequencing gel. 
The data were collected by an ABI PRISM 377 DNA sequencer and processed using the GENESCAN software {Parkin 
Elmer}. 

Following gel analysis, data were automatically processed with software that allows the determination of the 
alleles of biallelie markers present in each amplified fragment. 

The software evaluates such factors as whether the intensities of the signals resulting from the above 
microsequencing procedures are weak, normal or saturated, or whether the signals are ambiguous. In addition, the 
software identifies significant peaks (according to shape and height criteria]. Among the significant peaks, peaks 
corresponding to the targeted site are identified based on their position. When two significant peaks are detected for 
the same position, each sample is categorized as homozygous or heterozygous based on the height ratio. 

Association analyzes were then performed using the biallelie markers as described below. 

Example 19 
Association Analysis 

Association studies were run in two successive steps. In a first step, a rough localization of the candidate 
gene was achieved by determining the frequencies of the biallelie markers of Figure 9 in the affected and unaffected 
populations. The results of this rough localization are shown in Figure 10. This analysis indicated that a gene 
responsible for prostate cancer was located near the Wallalic marker designated 4-67. 

In a second phase of the analysis, the position of the gene responsible for prostate cancer was further refined using the 
very high density set of markers including the 99-123. 4-26, 4-14, 4-77, 99-217, 4-67, 99-213, 99-221, 99-135, 99- 
1482, 4-73, end 4-65 markers. 

As shown in Figure 11, the second phase of the analysis confirmed that tha gene responsible for prostate 
cancer was near the biallelie marker designated 4-67, most probably within a ca. 15Qkb region comprising the marker. 

A haplotype analysis was also performed as described in Example 20. 
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Examole 20 
Haplotype analysis 

The allelic frequencies of each of the alleles of biallcfic markers 99-123, 4-20, 4-14. 477, 99-217, 4-G7, 99- 
213, 99-221, and 99-135 wera determined in the affected and unaffected populations. Tabic 4 lists the internal 
identification numbers of the markers used in the haplotypo analysts, the alleles of each marker, the most frequent allele 
in both unaffected individuals and individuals suffering from prostate cancer, the least frequent allele in both unaffecied 
individuals and individuals suffering from prostata cancer, and the frequencies of the least frequent alleles in uacli 
population. 

Tabte4 

Frequency of least frequent allele * 9 



Markers 


Polymorphic bate * 


Cases 


Controls 


99-123 


CfT 


0.35 


0,3 


4-26 


A/G 


0.39 


0.45 


4-14 


err 


0.35 


0.41 


4-77 


C/G 


0.33 


0.24 


99-217 


err 


0.31 


0.23 


4-67 


err 


0.2G 


0.16 


99-213 


T/C 


0.45 


0.38 


99-221 


C/A 


0.43 


0.43 


99-135 


A|G 


0.25 


0.3 



most frequent allelolloast frequent allele 

standard deviations - 0.023 to 0.031 for controls 



•0.0 18 to 0,021 for cases 



Among all the theoretical potential different haplotypes based on 2 to 9 markers, 1 1 haplotypes showing a 
strong association with prostate cancer were selected, Th8 results of these hsplotype analyzes ars shown in Figure 1 2* 

Figures 11, and 12 aggregate association analysis results with sequencing results - generated following the 
procedures further described in Example 21 - which permitted the physical order and/or the distance betwean markers to 
be estimated* 

Tha significance of the volues obtained in Figure 12 are underscored by the following results of computer 
simulations, For the computer simulations, the data from the affected individuals and the unaffected controls were 
pooled and randomly allocated to two groups which contained the same number of individuals as the affected and 
unaffected groups used to compile the data summarized In Figure 1Z A haplotype analysis was run on these artificial 
groups for the six markers included in haplotype 5 of Figure 12- This experiment was reiterated 100 times and the 
results are shown in Figure 13. Among 100 iterations, only 5% of the obtained haplotypes are present with a p-vaiue 
lass significant than E-04 as compared to the p-vatue of 9 E -07 for haplotype 5 of Figure 12. Furthermore, for haplotype 
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5 of Figure 12, only 6% of the obtained haplctypes have a significance level below 5*-03, while none of them show a 
significance level below 5 E -03. 



Thus, using the data of Figure 13 and evaluating the associations for single marker alleles or for haplotypes 
will permit estimation of the risk a corresponding carrier has to develop prostate cancer. It wiii be appreciated that 
significance thresholds of relative risks will be more finely assessed according to the population tested. 

Diagnostic techniques for determining an individual's risk of developing prostate cancer may be implemented as 
described below for the markers in the maps of the present invention, including the 99-123, 4*26, 4-14, 4-77, 99-21 7, 
4-87, 99-213, 99-221, and 99-135 marker*. 

The above hanlatypc analysis indicated that 17lkb of genomic DNA between btollelic markers 4*14 and 99- 
221 totally or partially contains a geno responsible for prostate cancor. Therefore, tha protein coding sequences lying 
within this region were characterized to locate the gene associated with prostate cancer. This analysis, described in 
further detail beluw, revealed a single protein coding sequence in the 171 kb genomic region, which was designated as 
the PG1 gene. 



Template DNA for sequencing the PG1 gene was obtained as follows. BACs E and F from Fig. 9 were subcioned 
as previously described. Plasmid inserts were first amplified by PGR on PE 9600 thermocydcrs (Parkin-Elmer), using 
appropriate primers, AmpfiTaqGold (PerkMEmer), dNTPs (Boehringer), buffer and cycling conditions as recommended by the 
Perkin-Elmer Corporation. 

PGR products were then sequenced using automatic ABI Prism 377 sequencers (Perkin Elmer, Applied Biosystems 
Division, Foster City, CA). Sequencing reactions were performed using PE 9600 thermocyclers (Perkin Elmer) with standard 
dye-primer chemistry and ThermoSequenasc (Amersharn Ufa Science). Tltc primers were labeled with the JOE, FAM, ROX 
and TAMRA dyes. The dNTPs end ddNTPs used in the sequencing reactions were purchased from Boehringer. Sequencing 
buffer, reagent concentrations and cycling conditions were as recommended by Amersham. 

Following the sequencing reaction, the samples were precipitated with EtQH, rcsuspended in formamide loading, 
buffer, and loaded on a standard 4% acrylamide geL Electrophoresis was performed for 2.5 hours at 3000V on an ABI 377 
sequencer, and the sequence data wens collected and analyzed using the ABI Prism DNA Sequencing Analysis Software, 
version 2-1 .2, 

The sequence data obtained as described above were transferred to a proprietary database, where quality control 
and validation steps were performed. A proprietary base-caller flagged suspect peaks, taking into account the shape of the 
peaks, the inter-peak resolution, and the noise level The proprietary base-caller also performed an automatic trimming. Any 
stretch of 25 or fewer bases having more than 4 suspect peaks was considered unreliable and was discarded. 

The sequence fragments from BAC subclones isolated as described above were assembled using Gap4 
software from R. Staden (Bonfield et al. 19951 This software allows the reconstruction of a single sequence from 
sequence fragments. The sequence deduced from the alignment of different fragments is called the consensus 
sequence. Directed sequencing techniques (primer walking) were used to complete sequences and link contigs. 



E_xarnple 21 

ldentificntion_of the Genomic Sentience in the Candidate Region 
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Potential functional sequences were then identified as described in Example 22. 

Example 22 
Identification of Functional Sequences 
Potential exons in BAC*dcrivcd human genomic sequences were located by homology searchss on protein, nucleic 
acid and EST (Expressed Sequence Tags) public databases. Main public databases were locally reconstructed us mentioned 
in Example 14. The protein database, NRPU (Non*rcdundant Protein Unique) is formed by ii non-redundant fusion of tho 
Genpept {Benson ct al., Nuclm Ac&fsRes. 24:1-5 (1996), the disclosure of which is incorporated herein by relcrenccl, 
Swissprot (Bairoch, A. and Apweiler, Nuckk Acids Ms. 24:21-25 (1996), the disclosure Of which is incorporated herein 
by reference) and PIR/NBRF (Gocrgo et aU Nucleic Adds Ms. 24:17-20 (1996), the disclosure of which is incorporated 
herein by reference) databases. Redundant data were eliminated by using the NROB software (Benson et al. (1 996), saprs) 
and internal repeats were masked with the XNU software (Benson ct al., supra). Homologies found using the NRPU 
database allowed the identification of sequences corresponding to potential coding exons related to known proteins. 

The EST local database is composed by the ghest section (1*9) of GenBank (Benson et al. (1996), supra), and thus 
contains all publicly available transcript fragments. Homologies found with this database allowed the localization of 
potentially transcribed regions. 

The local nucleic acid database contained all sections of GenBank and EMBL (Rodrigue2-Tornc et al, Nucleic Acids 
Res. 24:0*12 (1996), the disclosure of which is incorporated herein by reference) except the EST sections. Redundant data 
wera eliminated as previously described. 

Similarity searches in protein or nucleic acid databases were performed using tha BLAST software (Altschul ct al., 
JL MoL BioL 215:403410 (1990), the disclosure of which is incorporated herein by reference). Alignments wera refined 
using the Fasta software, and multiple alignments used Clustal W. Homology thresholds .wera adjusted for each analysis 
based on the length and the complexity of the tested region, as well as on the sire of tho reference databasH. 

Potential exon sequences identified as abovo were used as probes to scraen cDNA libraries. Extremities of positive 
clones were sequenced and the sequence stretches were positioned on the genomic sequence determined above. Primers 
were then designed using the results from these alignments in order to enable the cloning of cDNAs derived from the gene 
associated with prostate cancer that was identified using the above procedures. 

The obtained tONA molecules were then sequenced and results cf Northern blot analysis of prostate mRNAs 
supported the existence of a major cDNA having a 5-6kb length. The structure of the gene associated with prostate cancer 
was evaluated as described id Example 23, 

Examnle 23 
Analysis of Gene Structure 

The intron/exon structure of the gene was finally completely deduced by aligning the mRNA sequence from the 
cDNA obtained as described above and the genomic DNA sequence obtained as described above. This alignment 
permitted the determination of the positions cf the introns and exons, the positions of tha start and end nucleotides 
defining each of tha at least 8 axons, the locations and phases of the 5' and 3' splice sites, the position of the stop 
codon, and the position of the poiyadenylation sita to ba determined in the genomic sequence. This analysis also yielded 
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the positions of the coding region In the mRNA, and the locations of the poiyaitenylation signal and polyA stretch in the 
mRNA. 

The gena identified as described above comprises at least 8 exuns and spans more than 52kb. A G/C rich 
putative promoter region was identified upstream of tiie coding sequence. A CCAAT in the putative promoter was also 
identified. Uiq promoter region was identifiod as described in Prestridge, D.S., Predicting Pol II Promoter Sequences 
Usinfl Transcription Factor Binding Sites, J, MoL BioL 249:923-932 (1995), the disclosure of which is incorporated 
herein by reference. 

Additional analysis using conventional techniques, such as a 5'RACE reaction using the Marathon-Ready 
human prostate cDNA kit from Ciontcch (Catalog. No. PT115IM), may be performed to confirm that tha 5* of the cDNA 
obtained above is tha authentic 5 f end in the mRNA. 

Alternatively, the 5'sequence of the transcript con be determined by conducting a PGR amplification with a 
series of primers extending from the 5'end of the identified coding region. 

The above methods were also used to identify biallelic markers in a gene which was an attractive candidate for 
a gena associated with asthma. Examples 24*31 show how the use of methods of the present invention allowed this 
gene to be identified as a gene responsible, at least partially, for asthma in the studied populations. Additional details of 
the identification of the gene associated with asthma are provided in U.S. Provisional Application Serial Nos. 
60/081,893 (6anset.02BPR) and U.S. Provisional Patent Application Genset.Q26PR2, the disclosures of which are 
incorporated herein by reference. 

Example 24 

Detection of hinllefic marker^ in the candidate jierie: DNA extraction 
Donors were unrelated and healthy. They presented a sufficient diversity fot bring representative of a French 
heterogeneous population. The DNA from 100 individuals was extracted and tested for the detection of the biallelic 
markers. 

30 ml of peripheral venous blood were taken from each donor in the presence of EDTA. Cells (pellet) were 
collected after tentrifugation for 10 minutes at 2000 rpm. Red cells were lysed by a lysis solution (50 ml final volume : 
10 mM Tris pH7.6; 5 mM MgCI2; 10 mM NaCD. The solution was centrifuged (10 minutes, 200Q rpm) as many times as 
necessary to eliminate the residual red cells present in the supernatant, after resuspension of the pellet in the lysis 
solution. 

The pellet of white cells was lysad overnight at 42°C with 3,7 ml of lysis solution composed of: 
•3 mlTE 10*2 (Tris-HC1 1 0 mM, EDTA 2 mM) I Nad 0.4 M 

• 200/ii snsio% 

- 500 fA K-proteinase (2 mg K-proteinase in TE 10-2 / NaCl 0.4 ML 

For tha extraction of proteins, 1 ml saturated NaCl (6M) {113.5 v/v) was added* After vigorous agitation, the 
solution was centrifuged for 20 minutes at 1O0QO rpm. 

For the precipitation of DNA, 2 to 3 volumes of 100% ethanol were addad to the previous supernatant and tha solution 
was centrifuged for 30 minutes at 2Q0D rpm. Tha DNA solution was rinsed three times with 70% ethanol to eliminate 
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salts, and centrifugad far 20 minutes at 2000 rpm, The pellet was dried at 37°C r and resuspended in 1 ml TE 10-1 or 1 
ml water. The DMA concentration was ovaluatod by measuring the CD at 260 run (1 unit OD - 50 //g/ml DNA). 

To determine the presence of proteins in the DNA solution, the OD 260 / OD 280 mtio was determined. Only 
DNA preparations having a 00 280 / OD 280 ratio between 1.8 and 2 were used in the subsequent examples described 
5 below. 

The pool was constituted by mixing equivalent quantities of ONA from each individUDl. 

Example 25 

(Intention of this hinllelin markers: amplification ol ncnomic PIMA by PGR 
The amplification of specific genomic sequences of the DNA samples of Example 24 was carried out on the 
10 pool of DNA obtained previously* In addition, 50 individual samples were similarly amplified. 



PGR assays were performed using the following protocol: 
Final volume 
DNA 

15 MgCI2 

dNTP(each) 
primor (each) 

Ampli Taq Gold DNA polymerase 
PCR buffer (1 Ox ~ 0.1 MTrisHCI pH8,3 0.5M KCI) 1x 

20 

Pairs of first primers were designed to amplify the promoter reQion, exons, and 3' end of the candidate asthma- 
associated gene using the sequence information of the candidate gene and the OSP software {Hillior & Green, 1991). 
These first primers were about 20 nucleotides in length and contained a common oligonucleotide tail upstream of the 
specific bases targeted for amplification which was useful for sequencing. The synthesis of these primers was 
25 performed following the phosphoramidite method, on a GENSET UFPS 24.1 synthesizer. 

DNA amplification was performed on a Genius II thermocyclor. After heating at 94° C for 10 min, 40 cycles 
were performed. Each cycle comprised: 30 sec at 94°C, 55 5 C for 1 min, and 30 sec at 72° C. For final elongation, 7 min 
at 72 6 C ended the amplification. The quantities of the amplification products obtained were determined on 96-well 
microtiter plates, using a fluoromster and Picogreen as intercalant agent (Molecular Probes). 
30 * Example 2G 

Detection of the biallelic markers: sequencing of amplified genomic DNA and identification of polymorphisms 
The sequencing of the amplified DNA obtained in Example 25 was carried out on ABI 377 sequencers. The 
sequences of the amplification products were determined using automated dideoxy terminator sequencing reactions with 
a dye terminator cycle sequencing protocol. The products of the sequencing reactions were run on sequencing gels and 
35 the sequences were analyzed as formerly described. 



25 //I 
2 ng/yyl 
2mM 
200 

2,9 nnfr/I 
0.O5 unit(/7l 
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The sequence data were further evaluated using the above mentioned polymorphism analysis software 
designed to detect the presence of hiallelic markers among the pooled amplified fragments. The polymorphism search 
was based on the presence of superimposed peaks in the electrophoresis pattern resulting from differunt bases occurring 
at the same position as described previously. 

Six fragments of amplification were analyzed. In these segments, 8 biallolic markers were detected. The 
localization of the bialleltc markers, the polymorphic bases of each allele, and the frequencies of the most frequent 
alleles was as shown in Table 5. 

Tablo 5 



Ainplican 


Mark&rNama 


Origin of ON A 


Localization in 
gone 


Polymorphism 


Fruquoncy 


1 


204/326 


Ind. 


Promoter 


A/G 


96.2 (Gl 


2 


327357 


Pool 


Intron 1 


A/C 


67.7 (0 


3 


33/175 


Ind. 


Exon 2 


c/r 


97.3 {C| 


3 


33/234 


Pool 


1 niton 2 


A/C 


56.7 (CI 


3 


33/327 


ind. 


Inuon 2 


C/T 


75.3 m 


5 


35(358 


Pool 


Intron 4 


CIG 


G7.9 (G) 


5 


35/390 


Ind. 


Intron 4 


C/T 


82 (C) 


6 


36/164 


Ind. 


Eion 5 


A/G 


99.5 (G) 



Allelic frequencies wera determined In a population of random blood donors from French Caucasian origin. Their wide 
range is due to the fact that besides screening a pool of 100 individuals to generate biallolic markers as doscfibed 
above, polymorphism searches were also conducted in an individual testing format for 50 samples. This strategy was 
chosen here to provide a potential shortcut towards the identification of putative causal mutations in the association 
studies using them. As the 36H 64 biallelic marker was found in only one individual this marker was not considered in 
the association studies. 

The fourth fragment of amplification carrying exon 3 (not shown in the Tablo) was not polymorphic in the 
tested samples (1 pool + 50 individuals). 
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Exam ple 27 

Validation of the polymorphisms through microscnuencinq 
The biallelic markers identified in Example 26 were further confirmed and their respective frequencies were 
determined through microsequencing. Microsequencing was carried out for each individual DNA sample described in 
5 Example 24. 

Amplification from genomic DNA of individuals was performed by PCR as deseiihuti above fm the detection of 
the hiailelic markers with the same set of PCR primors doscribed above. 

The preferred primers used in microscquencinrj had about 19 nucleotides in length and hybridized just upstream 
of the considered polymorphic base. 
10 five primers hybridized with the non-coding strand of the acne. For the biallelic markers 204/326, 35/358 and 36J164, 

primers hybridized with the coding strand of the genu* 

Hie microsequencing reaction was performed as described in Example 18. 

Example 28 

Association study between asthma and the biallelic markers nf the candidate n one: collection of D NA samples from 
15 affected and nnn-nf fecteri individuals 

The asthmatic population used to perform association studies in order to establish whether the candidate gene 
was an asthma-causing gene consisted of 298 individuals. More than 90 % of these 298 asthmatic individuals had a 
Caucasian ethnic background. 

The control population consisted of 373 unaffected individuals, among which 279 French (at least 70 % were 
20 of Caucasian origin) and 94 American (at least 90 % ware of Caucasian origin). 

DNA samples were obtained from asthmatic and non-asthmatic individuafs-asUcscribcd above. 

Example 29 

Association study between asthma and the biallelic markers nf t he candidate cane: aenotvpinq pf nf fettled and control 

individuals 

25 The general strategy to perform the association studies was to individually scan the DNA samples from all 

individuals in each of the populations described above in order to establish the allele frequencies of the above described 
biallelic markers in each of these populations. 

Allelic frequencies of the above-described biallolic markers in each population were determined by performing 
microsequencing reactions on amplified fragments obtained by genomic PCR performed on the DNA samples from each 
30 individual Genomic PCR and microsequencing were porformed as detailed above in Examples 25 and 27 using the 

described amplification and microsequoncing primers. 

Example 30 

Association studv between asthma and the bi altelic markers of the candidate nene 
Table 6 shows the results of the association study between five biallelic markers in the candidate rjene and 

35 asthma. 
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Tabic 6 

Allelic frequencies (%) 



Markers 


Asthmatics 
298 individuals 


Controls 
373 individuals 


Frequency diff. 


P valun 


32/357 


A38.B 


A 20.8 


8.8 


7.34x1 0* 4 


33/234 


A 49 


A 44.3 


4.7 


8,8Cx1Q* 2 


33/327 


T70.5 


T74.6 


3.9 


I.OxlO* 1 


35/358 


G72.3 


G 66.9 


5.4 


3.50x1 IT 2 


35/390 


T30.4 


T2D.3 


1D.1 


2.33x1 0' 5 



As shown in Table 6, markers 32/357 and 35/390 presented a strong association with asthma, this association being 
highly significant { pvalue - 7.34x10-4 for marker 32/357 and 2.33x10-5 fnr marker 35-390). 

Three markers showed moderate association when tested independently, namely 33/234, 33/327, 35/358. 

It is worth mentioning that allelic frequencies for Bach of the biallelic markers of Table 6 ware separately 
measured within the French control population (279 individuals) and the American control population (34 individuals). 
The differences in allele frequencies between tha two populations were between 1 % and 7%, with p-valucs above 10"\ 
These data confirmed that the combined French/American control population (373 individuals) was homogeneous enough 
to bo used 8S a control population for the present association study. 

Example 31 

Association studies: Haplotype frequency analysis 
As already shown, one way of increasing the statistical power of individual markers, is by performing 
haplotype association analysis. A haplotype analysis for association of markers in the candidate gone and asthma was 
performed by estimating the frequencies of all possible haplotypes for biallelic markers 321357, 33/234. 331327, 35/358 
and 35/390 in the asthmatic and control populations described in Example 30 [Table 6), and comparing those frequencies 
by means of a chi square statistical test (one degree of freedom), Haplotype estimations were peri armed by applying the 
Expectation-Maximization [EM) algorithm {Excoffier L & Slatkin M. 1995, Moi.Bioi.Evol. 12:921-927), using the EM- 
HAPLO program (Hawiey ME, Pakstis A J &Kidd KK, 1904. Am.J.Phys.Anthropol. 18 : 104), 
The results of such haplotype analysis are shown in Table 7. 
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Tabic 7 



Markers 321357 
Froquoncy dilf. 8.B 
Pvalu* 7.34x1 0 -1 

HaplotypH 1 A 
10 HapIotypa2 

Ilapiotype 3 A 



33(234 


331327 


351358 


35f33U 


4.7 




G.4 


1D.I 


8.36x1 0* 2 


l.QxIO' 1 


3.53x10 2 


2.33x10 s 



A 
A 



llaplutypu 
frequencies 

Asllim. 



Controls 



Oddirntin P value 



T 


0.2 


0.11 


2.02 


0.47x10 


G 


0.27 


0.18 


1.G8 


2.fi1r10 


G T 


0.16 


0.09 


2.22 


3.05x10 



A two-marker Itaplotype covering markers 32/357 and 35/390 (haplotype 1, AT alleles respectively) presented 
15 a p value of 8.47x10-6, an odds ratio of 2.02 and hapiotypc frequencies of 0.2 for asthmatic and 0.1 1 for control 

populations respectively. 

A three-marker Itaplotype covering markers 33/234, 331327 and 35/350 (hapiotypc 2 f ATG alleles respoctivoly) 
presented a p value of 2.81x104. an odds ratio of 1.68 and haptotype ftequencins of 0.27 for asthmatic and 0.10 for 
control populations respectively. 

20 A five-marker haplotype covering markers 32/357, 331234, 33/327, 35/358 and 3,5/390 (haplotype 3, AATGT 

alleles respectively] presented a p value of 3.95x10-5, an odds ratio of 122 and hapiotypc frequencies of 0.18 for 
asthmatic and 0.09 for control populations respectively. 

Haplotype association analysis thus increased tho statistical power of the individual marker association 
studies when compared to single-marker analysis {from p values between 1Q' 1 and 2X10* 5 for the individual markers to p 
25 values between 3X1 0 4 and 8X1 0' 6 for the three-marker hapiotypc, haptotype 2). 

The significance of the values obtained for the haplotype association analysis was evaluated by the followlnrj 
computer simulation test. The genotype data from the asthmatic and control individuals were pooled and randomly 
allocated to two groups which contained the same number of individuals as the trait positive and trait negative groups 
used to produce the data summarised in Table 7. A haplotype analysis was then run on these artificial groups for the 
30 three haplotypes presented in Tabic 7. This experiment was reiterated 1000 times and the results are shown in Tabio 8. 
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Hoplolype 
Haplotvpe 1 
(A--TJ 

Haplulypc Z 
(-ATC-J 
Hapiatype 3 
(AATGT) 




Chi-Square 
19.70 
13.49 
16.06 




.84- 
Table 8 

Permutation Tost 
Avcranc Chi-Squaro 

1,2 

1.2 

1.2 
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Maximal Chi-Squaro P value 



11.6 



10.5 



9.3 



l.OxHT 



1.0x10 3 



1.0x10 



•3 



10 



The results in Table 8 show that among 1000 iterations only 1%n of the obtained haplotypus has a pvolue 
comparable to the one obtained in Tablo 7. 

Those results clearly validate the statistical significance of the haplotypes obtained [haplatypos 1, 2 ami 3, 

15 Table 7). 

While Examples 15*31 illustrate the use of the maps and markers of the present invention for identifying a nes 
gene associated with a complex disease within a 2Mb genomic region for establishing that a candidate gene is, at least 
partially, responsible for a disease, the maps and markers of the present invention may also be used to identify one or 
more biallelic markers or one or more genes associated with other detectable phenotypes, including drug response, drug 
20 toxicity, or drug efficacy. The biallelic maikers used in such drug responso analyses or shown, using the methods of the 

present invention to be associated with such traits, may lie within or near genes responsible for or partly responsible for 
a particular disease, for example a disease against which the drug is meant to act, or may lie within genomic regions 
which are not responsible for or partly responsible for a disease. For example, the genomic region harboring markers 
associated with a particular drug rosponse may carry a drug metabolism gene, or a gene encoding a protein with a role in 
25 the drug response mechanism. Thus, biallelic markers within or near genes known to be involved In drug response. 

toxicity, or efficacy or genes suspected of being involved in drug response, toxicity, or efficacy may b8 used to identify 
individuals likely to respond positively or negatively to drug treatment. In the context of the present invention, a "positive 
response" to a medicament can be defined as comprising a reduction of the symptoms related to the disease or condition 
to he treated, In the context of the present invention, a "negative response' to a medicament can bs defined as 
30 comprising either a lack of positive responso to the medicament which does not lead to a symptom reduction or to a 

side-effect observed following administration of the medicament 

Drug efficacy, response and toleranoe/toxicity can be considered as multifactorial traits involving a genetic 
component in the same way as complex diseases such as Alzheimer's disease, prostate cancer, hypertension or diabetes. 
As such, the identification of genes involved in drug efficacy and toxicity could be achieved following a positional cloning 
35 approach, e.g. performing linkafle analysis within families in order to obtain the subchromosomal location of the gene(s). 
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However, this type of analysis is actually impractical in the case of drug responsiveness, due to the lack of availability of 
familial cases. In fact, tho likelihood of having, more than one individual in a particular family being exposed to tho same 
drug at the same time is very low. Therefore, drug efficacy and toxicity can only be analyzed as sporadic traits. 

In order to conduct association studies to analyze the individual response tu a given drug in groups of patients 
affected with a disease, up to four groups are screened to detB/mino their patterns of biallolic markers usinn the 
techniques described above. TIic four groups arc: 
. Non-diseasod or random controls, 

• Diseased patients/drug respnnders, 

■ Diseased patients/drug non-rospondcrs, 

• Diseased patients/drug side effects, 

In preferred embodiments, the above mentioned groups are recruited according to phenotyping criteria having 
the characteristics described above, so that the phenotypes defining the different groups are non-Overlapping, preferably 
extreme phenotypes. 

In highly preferred embodiments, such phenotyping criteria havo the bimodal distribution doscribed ahuve. 
The final number and composition of the groups for each drug association study is adapted 
to the distribution of the above described phenotypes within the studied population. 



described herein to identify one or more bialielic markers associated with drug response, preferably drug toxicity or drug 
efficacy. The identification of such one or more bialielic markers allows one to conduct diagnostic tests to determine 
whether the administration of a drug to an individual will result in drug response, preferably drug toxicity, or drug 
efficacy. * " 

Tho methods described above for identifying a gene associated with prostate cancer and bialielic markers 
indicative of a risk of suffering from asthma may be utilized to identify genes associated with other detectable 
phenotypes. In particular, the above methods may be used with any marker or combination of markers included in the 
maps of the present invention, including the 653 bialielic markers obtained above (which include the sequences of SEQ 
ID Nos. 1*50 and 51-100 or the sequences complementary thereto), the PG1 markers, the asthma-associated markers, 
and tho Apo E markers of SEQ ID Nos. 301-305/307-31 1 or the sequences complementary thereto. As described above, 
the general strategy to perform the association studies using the maps and markers of the present invention is to scan 
two groups of individuals {trait positive individuals and trait negative controls) characterized by a well defined phonotype 
in order to measure the allele frequencies of the bialielic markers in each of these groups, Preferably, tho frequencies of 
markers with inter-marker spacing of about 150 kb are determined in each groups. More preferably, the frequencies of 
markers with inter-marker spacing of about 75 t<b are determined in each group. Even more preferably, markers with 
inter-marker spacing of about 50 kb, about 37.5kb, about 30kb, or about 25kb will be tested in sach population. For 
genome-wide studies, it will be preferred to measure the frequencies of about 20,000, or about 40,000 bialielic markers 
in each group. In a highly preferred embodiment, the frequencies of about 60,O0D, about 80,000, about 100,000, or 
about 120,000 bialielic markers are determined in each group. In some embodiments, hanlotype analyses mav be nm 



After selecting a suitable population, association and haplotypc analyses may be performed as 
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using groups of markers located within regions spanning less than 1kh f from 1 to 5kb. from 5 to IGkb, from 10 to 25kb, 
from 25 to 5Dlcb f from 50 to 150kb r from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, or more than 1 Mb. 

Allele frequency can be measured using rnicrosequencing techniques (Ascribed herein; preferred high 
throughput microsoquoncinfl procedures are further exemplified below; it will be further appreciated that any otitor large 
scalo gonotyping method suitable with the intended purpose contemplatod herein may nlso he used. 

In some embodiments of the present invention 3 computer-based system may support the on-line coordination 
between the identification of bialtelic markers and the corresponding analysis uf their frequency in tho different gamps. 

It will be appreciated that it is not nocossary tu use a full Ittyft density biallelic marker map in order to start a 
genome-wide association study. It is sufficient to generate and use a first sot of about 20,000 markers (one marker per 
BAC, average intcr-maiker spacing of about 150kb). Maps having higher densities of bialtelic markers (two or more 
markers per BAC, average inter-marker spacing of about 75kb or less) may then be generated by starting first on those 
BACs for which a candidate association has been established at the first step. 

In cases when one or more candidate regions have previously been dolineated, such as cases where a particular 
gena or genomic region is suspected of being associated with a trait, local excerpts of biallelic marker maps having 
densities above one marker per 150kb may be exploited using BACs harboring said genomic tegions, or genus, or portions 
thereof. In these cases also, successive association studies may be performed using sets of biallelic markers showing 
increasing densities, preferably from about one every 150 kb to about one every 75kb; more preferably, sets of markers 
with inter-marker spacing below about 50kb, below about 37.5kb, below about 30kb, most proforably below about 25 
kb, will be used. 

Haplotyps analyses may also be conducted using groups of biallelic markers within the candidate region. The 
biallelic markers included in oach of these groups may be located within a ijuncmic rogian spanning loss than Ikh, from 1 
to 5kb, from 5 to lOkb, from 10 to 25kb ( from 25 to 50kh, from 5D to 150kb, from 150 to 250kb, from 250 to 500kb r 
from 5G0kb to 1Mb, or more than 1Mb. It will be appreciated that the ordered DNA fragments containing these groups Df 
biallelic markers need not completely cover the genomic regions of these lengths but may instead be incomplete contigs 
having one or more gaps therein. As discussed in further detail below, biallelic markers may he used in association studies 
and haplotypa analyses regardless of the completeness of the corresponding physical contig harboring them, provided linkage 
disequilibrium between the markers can be assessed. 

As described above, if a positive association with a trait, such as a disease, or a drug efficacy and/or toxicity, 
is identified using the biallelic markers and maps of the present invention, the maps will provide net only the 
confirmation of the association, but also a shortcut towards the identification of the gene involved in the trait under 
study. As described above, since the markers showing positive association to the trait are in linkage disequilibrium with 
the trait loci, the causal gene will be physically located in the vicinity of these markers. Regions identified through 
association studies using high density maps will on average have a 20 - 40 times shorter length than those identified by 
linkage analysis {2 to 20 Mb). 

As descrihed above, once a positive association is confirmed with the high density biallelic marker maps of the 
present invention, BACs from which the most highly associated markers were derived are completely sequenced and the 
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mutations in tho causa! gene are searched by applying genomic analysis tools. As described above, once a region 
harboring a gone associated with a detectable trait has been sequenced and analyzed, the candidate functional regions 
(tug. exons and splice sites, promoters and other regulatory regions) arc scanned for mutations by comparing the 
sequences of a selected number of controls and cases, using adequate software. 

In some embodiments, trait positive samples being compared to identify causal mutations are selected among 
those carrying the ancestral haplotype; in these embodiments, control samples are chosen from individuals not carrying 
said ancestral haplotype. 

In further embodiments, trait positive samples being compared to identify causal mutations ar8 selected ninong 
those showing haplotypes that are as close as possible to tho ancestral haplotypo; in these embodiments, control 
samples are chosen from individuals not carrying any of the haplotypes selected far the case population, 

Die mutation detection procedure is essentially similar to that used for biallelic site identification. A pair of 
oligonucleotide primers are designed in order to amplify the saquonces to be tested. In preferred ombodimonts, priority is 
given to the testing of functional sequences; in such embodiments, sequences covering every exon/prontutur predicted 
region, preferably including potential splice sites, are determined and compared between the T+ and T- populations. 
Amplification Is carried out on DNA samples from T+ and T- individuals using the polymerase chain reaction undor the 
above described conditions. To be sequenced, amplification products from genomic PGR may be subjected to automated 
dideoxy terminator sequencing reactions and elactrcphoresed on AB1 377 sequencers. Following gel image analysis and 
DNA sequence extraction, AB! sequonce data arc automatically analyzed to detect ihs presence of sequence variations 
among T+ and T- individuals. Sequences are preferably verified by comparing the sequences of both DNA strands of 
each individual 

It is preferred that candidate polymorphisms be then verified by screenings larger population of cases and 
controls by means of any genotyping procedure such as those described herein, preferably using a microscqucncing 
technique in an individual test format. Polymorphisms are considered as candidate mutations when present in cases and 
controls at frequencies compatible with the expected association results, 

The maps and biallelic markers of the present invention may also be used to identify patterns of biallelic 
markers associated with detectable traits resulting from polygenic interactions. The analysis of genetic interaction 
between alleles at unlinked loci requires individual genotyping using the techniques described herein. The analysis of 
allelic interaction among a selected set of biallelic markers with appropriate p-values can he considered as a haplotype 
analysis, similar to those described in further details within the present invention. 

Use of ; Biallelic Markers to Identify Individuals likely to Exhibit a Detectable 
Trait Associated with a Particular Allele of a Known Gone 
In addition to their utility in searches for genes associated with detectable traits on a gunome-wide, chromosome- 
wide, or subchromosomal level tha maps and biallelic markers of tho present invention may be used in more targeted 
approaches for identifying individuals tikely to exhibit a particular detectable trait or individuals who exhibit a particular 
detectable trait as a consequence of possessing a particular allele of a gene associated with the detectable trait. For 
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example, the biallelic ntarkors and maps of the present invention may ba used to identify individuals who carry an allele of a 
known gene that is suspected of being associated with a particular detectable trait. In particular, the target gengs may be 
genes having alleles which predispose an individual to suffer from a specific disease state. In other cases, the rargot tjonos 
may be genes having alfelcs that predispose an individual ta exhibit a desired or undesired response to 0 drug or other 
pharmaceutical composition, a food, or any administered compound. The known gens may encode any of a variety of typos 
of ujomolecules. For example, the known gonos targeted m such analyzes may bo genes known to be involved in a pnilicular 
step in a metabolic pathway in which disruptions may cause a detectable traiL Alternatively, the target yenos may be ocnes 
encoding receptors or lipuds which bind to receptors in which disruptions may cause a detectable trait, genes encoding 
transporters, genes encoding proteins with signaling activities, genes encoding proteins involved in the immune response, 
genes encoding proteins involvod in hematopoesis, or genes encoding proteins involved in wound healing. It will be 
appreciated that the target genes are not limited to those specifically enumerated above, but may be any gone known to 
be or suspected of being associated with a delectable trait. 

As previously mentiuned, the mops and markers of tJie present invention may be used to identify genes 
associated with drug response. Accordingly, the presont invention comprises a method of using a drug comprising 
obtaining a nucleic acid sample from an individual, determining the identity of the polymorphic base of one or more 
biallelic markers obtained by tho mothods Ascribed above which is or are associated with a positive response to 
treatment with the drug or one or more biallelic markers obtained by the methods described above which is or are 
associated with a negative response to treatment with the drug, and administering the drug to the individual if the 
nucleic acid sample contains one or more alleles of biallelic markers associated with a positive response to treatment 
with tho drug or if said nucleic acid sample lacks one or more alleles of biallelic markers associatod with a negative 
response to the drug. In some embodiments of the method, the administering step comprises administering the drug to 
the individual if the nucleic acid sample contains one or more alleles of biallelic markers associated with a posilive 
response to treatment with the drug and the nucleic acid sample lacks one or more alleles of biallelic markers associated 
with a negative response to the drug. 



the clinical trials of a drug. By selecting individuals who are likely to respond favorably to a drug for inclusion in the 
trial the effectiveness of the drug can be assessed without lowering the measured effectiveness as a result of including 
non-responders or negative responded in the clinical trial May he more importantly, using such selection may avoid 
including patients who may sulfer from undesirable side effects if administered the drug under trial, thus increasing ihe 
safety of clinical trials. Accordingly, the present invention also includes a method of selecting an individual for inclusion 
in a clinical trial of a drug comprising obtaining a nucleic acid sample from an individual, determining the identity of the 
polymorphic base of one or more bialleOc markers obtained by the methods described above which is or are associated 
with a positive response to treatment with the drug or one or more biallelic markers associated with a negative response 
to treatment with tho drug in the nucleic acid sample, and including the individual in the clinical trial if the nucleic acid 
sample contains one or more alleles of biallelic markers obtained by the methods described above which is or are 
associated with a positive response to treatment with said drug or if the nucleic acid sample tacks one or more alleles of 



The biallelic markers of the present invention may also be used to select individuals for inclusion in 
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biallelic markers associated with a negative response to the drug. In one embodiment of the method, tho inclusion step 
comprises including the individual in the clinical trial if the nucleic acid sample contains one or more alleles of biallelic 
markers associated with a positive response to treatment with the drug and the nucleic acid sample Kicks one or more 
alleles of biallelic markers associated with a negative response to tho drug. 

5 In particular embodiments, one or several of the ApoE Jinked markers of SEtl 10 Nos 301-3G5/3D7-311 or the 

sequences complementary thereto ma/ be used in targeted approaches to identify individuals who are likely to develop 
Alzheimer's disease, or to identify individuals who do suffer from Abhuimer's disease In other embodiments, tine or more of 
the markers of SEQ 10 Nos. 306 and 312 and one or more of the tho ApoE linked markers of SCQ ID Nus 301-305/307-31 1 
or the sequences complementary thereto are genotyped approaches to identify individuals whu are liksly to develop 

ID Alzheimer's disease, or to identify individuals who do suffer from Alzheimer's disease, in further embodiments, one or several 

of the PG1 linked markers may bn tested in targeted approaches to identify individuals who aro likely to develop prostate 
cancer, qr to identify individuals who do suffer from prostate cancer. Finally individuals likely to bo asthmatic* or asthmatic 
individuals, can be identified using one or more of the asthma-associated markers to conduct the procedures of the present 
invention. 

15 Given the high number of cancer types in which the FG1 chromosomal region is involved, it will bo appreciated that 

the PG1 markers may bo employed to identify individuals at risk of developing cancers other than prostate cancer, or to 
identify individuals suffering from cancors other than prostate cancer. It will be further appreciated that the asthma- 
associated markers may be tested to identify individuals likely to exhibit or exhibiting inllanunatory traits ether than the 
asthmatic state (e.rj, arthritis, or psoriasis, among others). The present invention provides adequate methods to establish 
20 associations botween markers, such as those mentioned above and candidate traits expressly contemplated herein, thus 

legitimating the cairespcridina, targsted approaches to identify individuals Okely to exhibit, or exhibiting said candidate traits. 

In some embodiments, the 653 biallelic markers obtained above (which include the sequencos of SEO ID Nos. 
1*50 and 51-100 or the sequences complementary thereto) may be used in targeted approaches to identify individuals at 
risk of developing a detectable trait, for example a complex disease or deshed/undesirad drug response, or to identify 
25 individuals exhibiting said trait. The present invention provides methods to establish putative associations between any of 

the biallelic markers described herein and any detectable traits, including those specifically described herein. 

To use the maps and markers of the present invention in further targeted approaches, biallelic markers which are 
in linkage disequilibrium with any of the above diseased markers may be identified. In cases where one or more biallelic 
markers of the present invention have been shown tn be associated with a detectable trait, moro biallelic markers in linkage 
30 disequilibrium with said associated biallelic markers may be generated and used to perform targeted approaches aiming at 

idontifyinrj individuals exhibiting, or likely to exhibit said detectable trait, according to the methods provided herein. 

Furthermore, in cases where a candidate gene is suspected of being associated with a particular detectable trait or 
suspected of causing the detectable trait, biallelic markers in linkage disequilibrium with said candidate rjene may be 
identified and used in targeted approaches, such as the approaches utilized above for the asthm3*associated gene and the 
35 Apo E gene. 
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Bialleiic markers that are in linkage disequilibrium with markers associated with a detectable trait, or with genes 
associated with a detectable trait, or suspected of being so, are identified by performing single marker analyzes, hnplutype' 
association analyzes, or linkage disequilibrium measurements on samples from trait positive and trait negative individuals as 
described above using bialleiic markers lying in the vicinity of the target marker or gene. In this manner, a single bialleiic 
5 marker or a group of bialleiic markers may bo identified which indicate that an individual is likely to possess the detectable 

trait ar doos possess ths detectable trait as a consequence of a particular aflcfe of the target marker or gune. 

Nucleic acid samples from individuals to bo tested for predisposition to 3 detectable trait or possession or a 
detectable trait as a consequence of a particular allele of the target gene may bo examined using the diagnostic methods 
described belnw. 

10 Diagnostic Methods 

To use tho m3ps and bialleiic markers of the present invention to diagnose whether an individual is predisposed io 
express a detectable trait or whether tho individual expresses a detoctablo trait as a result of a particular mutation, one or 
mora bialleiic markers indicative of such a predisposition or causative mutation arc identified by performing association 
studies and haptotype analysis unaffected and non-affected individuals as described above. 

15 The diagnostic techniques of the present invention may employ a variety of' methodologies to determine 

whethor a test subject has a biallolic marker pattern associated with an increased risk of developing a detectable trait or 
whether the individual suffers from a detectable trait as a result of a particular mutation, including methods which 
enable the analysis of individual chromosomes for haplotypinfj, such as family studies, single spetm UNA analysis or 
somatic hybrids. 

2Q Tho trait analyzed using the present diagnostics may be any detectable trait, including diseases, drug response, 

drug efficacy, or drug toxicity. A 'positive" drug response may refer to a response indicating either some drug efficacy 
or no drug toxicity. Diagnostics which analyze drug response, drug efficacy, or drug toxicity may be used to daternVme 
whethor an individual should be treated with a particular drug. For example, if the diagnostic indicates a likelihood that 
an individual will respond positively to treatment with a particular drug, the drug may be administered to the individual. 
25 Conversely, if the diagnostic indicates that an individual is likely to respond negatively to treatment with a particular 

drug, an alternative course of treatment may ba prescribed. A negative response may bs defined as either the abssnce 
of an efficacious response or the presence of toxic side effects. 

Clinical drug trials represent another application for the maps and markers of tho present invention, Ona or 
more markers indicative of drug response, drug efficacy, or drug toxicity may be identified using tho techniques 
30 described above. Thereafter, potential participants in clinical trials of the drug may be screened to identify those 

individuals most likely to respond favorably to the drug and excludo those likely to experience side effects. In that way, 
the effectiveness of drug treatment may be measured in individuals who respond positively to the drug, without lowering 
the measurement as a result of the inclusion of individuals who are unlikely to respond postively in the study and 
without risking undesirable safety problems. 
35 In each of the diagnostic methods, a nucleic acid sample is obtained from the test subject and the bialleiic 

marker pattern for one or more of the bialleiic markers included in the maps of the present invention, inctudino the 653 
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bialleiic markers obtained above (which include the sequences of SEQ ID Nos. 1-50 and 51-100 or the sequences 
complementary thereto), the asthma-associated bialfelic markers, the PG1 bialioSic markers, and the Apo E biollnfic 
markers, including those of SEQ ID Nos. 301-305/307-311 or the sequences complementary thereto. In other 
embodiments, the biaiiolic murker pattern of one or mors of the markers of SEQ 10 Nos. 306 and 312 is liutnrmined in 
addition to determining lite bialleiic marker pattern of ono or more of tha bialleiic markers included hi thu maps of the 
present invention, including the 653 binllelic markers obtained above (which include the suquencos of SEQ ID Nns. 1-50 
and 51-100 or the sequences complementary thereto), the asthma-associated hiallefic markers, the PG1 liinllelic 
markers, and tha Apo E bialleiic markers, including those uf SEO ID Nos. 301*305/307-311 or the sequences 
complementary thereto, in some embodiments, the bialleiic marker pattern is determined by conducting an amplification 
reaction to generate amplicons containing the polymorphic bases of the one or more biaMc markers to be pnotyped. 
The ideniies of tha polymorphic bases of the one or more bialleiic markers to be analyzed may be determined using a 
variety of methods, including hybridization assays which specifically detect amplification products containing particular 
alleles of the one or more biaiJelic markors, and microsequencing reactions which identify the polymorphic bases of the 
one or more bialleiic markers to he anlayzed. 

While the following discussion utilizes the 653 bi3llelic mates obtained above (which include the sequences 
of SEQ ID Nos. 1-50 and 51-100 or the sequences complementary thereto), the asthma-associated bialleiic markors, the 
PG1 bialleiic markers, and the Apo E bialleiic markers as examples of the diagnostics of the present invention, it will be 
appreciated that the same diagnostics may he used in conjunction with any marker or any gruup of markers included in 
the maps of the present invention, 

Examples of amplification primers enabling tho amplification, from subjects rjenomic DNA samples, of DMA 
fragments that carry each of the markers of SEQ ID Nos: 1-50 and 51-100 or the sequences complementary thoroto, are 
oligonucleotides of SEQ ID NOs: 101-150 and 151-200; pairs of corresponding primers for a given bialleiic markar may 
be reconstituted by choosing the adequate upstream oligonucleotide from SEQ !D Nos. 101-150 together with ihe 
corresponding downstream oligonucleotide from SEQ ID Nos: 1S1-200. 

SEQ ID Nos: 1-5D correspond to tho sequence identification number for a first allele of the bialleiic markers of 
SEQ ID Nos: 1-50 and 5H00 and SEQ ID Nos: 51-100 correspond to the sequence identification number for a second 
allele of the bialleiic markers of SEQ ID Nor. 1-50 and 51-100. 

SEQ ID Nos: 313*318 correspond to sequence identification numbers of upstream amplification primers 
that may be used to generate amplification products containing the polymorphic bases of the bialleiic markers of 
respective SEO ID Nos: 301-306/307-312. SEQ ID Nos: 319-324 correspond to downstream amplification primers that 
may be used to generate amplification products containing the polymorphic bases of the hielUlic markers of respectivo 
SEQ ID Nos: 301-306/307-312. 

For all markers of SEQ ID Nos: VSOfiMOQ and 30l*3Q6f307-312 or the sequences complementary thereto, 
the enclosed listings indicate the position and identity of the polymorphic base in each bialleiic marker. Potential 
mierosequencing primers are also included in the sequence listing. The sequences of SEQ ID Nos. 201-250 may be used 
in mierosequencing procedures such as those described herein to determine the sequence of the polymorphic bases of the 
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biallelic markers of SEQ ID Nos. 1-5CM5M00. The sequences of SEQ ID Nos. 325-330 or 331-336 may be used in 
mfcrosequancing procedures such as those described herein to determine the sequence of the polymorphic bases of the 
biallolic markers of SEO ID Nos. 3Q1-30G/3Q7-312. 

All listings indicate Urn intnrnal identification number corresponding to the biallelic marker to which the listed soquBnce 
is related to. 

One aspect of the present invontion is a method for determining whether an individual is at risk uf devulupinu 
Alzheimer's Disease cr whether an individual suffers fmni Alzheimer's Disease as a consequence of possessing the Apo E 
e4 site A allele. The method involves obtaining a nucteic acid sample from the individual and determining whether the 
nucleic acid snmpie contains one or more markers indicative of a risk of developing Alzheimer's Disease or one or more 
markers indicative that the individual suffers from Alzheimer's Disease as a result of possessing tim Apn E g4 site A 
allele. In one embodiment, the method comprises determining the identity of the polymorphic base of one or more 
biallelic markers selected from the group consisting of SEQ ID Nos, 301*305/307*312 or the sequences complementary 
thereto in the nucleic acid sample. In a further embodiment, the method involves determining whether the nucleic acid 
sample contains the sequence of SEQ ID No. 30S (the C allele of marker 99-2452754 containing the Apo E e4 site A 
allele) or the sequence complementary thereto, In a further embodiment the method comprises determining whether the 
nucleic acid sampla contains SEQ ID No. 311 (the T allele of marker 93-365/344) or the soquence complementary 
thereto, In another embodiment, the method comprises determining whether the nucleic acid sample contains SEQ ID 
No. 31 1 (tha T atlale of markor 99-3651344) and SEQ ID No. 306 (the C allele of marker 99-2452/54 containing the Apo 
E site A allele) or the sequence complementary thereto. 

In still a further embodiment, the method comprises determining whether tha nucleic acid sanplc contains SEQ 
10 No. 302, 301! 3Q3, and 304 or the sequences complementary thereto. In still riurthor embodiment, the method 
comprises determining whether the nucleic acid sanple contains SEQ ID Nos. 302, 303, and 3D4 or the sequences 
complementary thereto. In a further embodiment the method comprises determining whether the nucleic acid sample 
contains SEQ ID No. 311 (the T allele of marker 99-365/3441 or the sequence complementary thereto. 

In some embodiments, the step of determining the identity of iho polymorphic base of one or more biallelic 
markers selected from tho group consisting of SEQ 10 Nos. 301-305 and SEQ 10 Nos, 307*311 or the sequences 
complementary thereto in the nucleic acid sample comprises conducting an amplification reaction on soid nucleic acid 
sample using one or more of the amplification primers selected from the group consisting of SEQ ID Nos, 313*317 and 
SEQ ID Nos. 319-323 and determining the identity of the polymorphic base in said one or more biallelic markers. 

In some embodiments, tho identity of tha polymorphic base may be determined using one or more of the 
microsequencing primers listed as SEQ ID Nos. 325-329 or 331-335. In embodiments comprising the step of 
determining whether the nucteic acid sample contains the sequence of SECL 10 No. 306, the method may comprise 
conducting an amplification reaction on the nucleic acid sample using the pair of amplification primers constting of SEQ 
10 Nos. 318 and 324. In some embodiments, the stop of determining whether the nucleic acid sample contains the 
sequence of SEQ ID 306 comprises conducting a microsequencins reaction using one of the microsequencing primers 
listed as SEQ ID Nos, 330 or 336. 
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Another aspect of the present invention relates to a method of determining whother an individual is ot risk of 
developing a trait or whether an individual expresses a trait as a consequence of possessing o particular trait-causing 
allele. Alternatively, another aspect of the present invention relates to a method of determining whether 8n individuni is 
at risk of developing a plurality of traits or whether an individual expresses a plurality of traits as a result of possessing 
particular trait-causing alleles. Those methods involve obtaining a nucleic acid sample from the individual and 
determining whether the nucleic acid sample contains one or mure markers indicativo of a risk «f developing tha trait or 
one or more markers indicative that the individual expresses the trait as a result of possessing a particular trait- causing 
allele. In one embodirnont; the methods comprise determining the identity of tha polymorphic b3S9 of one or muro 
bialielic markers in the maps of the present invention, including any of tho 653 bialielic markers obtained above (which 
include the sequences of SEQ ID Nos. 1-50 and 51-100 or tha sequences complementary thereto), the asthma-associated 
bialielic markers, the PGl bialielic markers, and the new Ape E bialielic markers. In a further embodiment, the methods 
comprise determining (he identities of the polymorphic bases of at least two, at least three, at least five, at least eight, 
at least 20, at least 100, at least 200, at least 3DD, at least 400, bctwaen 400 aiuf 2,000, between 2,000 and 4,000, 
between 4,000 and 10,000, between 1Q,QDO and 20,000 or more than 20,000 af the bialielic markers in the maps of 
the present invention, including any of tho 653 bialielic markers obtained ahovii (which include the sequences of SEQ 10 
Nqs. 1*50 and 51*100 or the soquences complementary thereto), the astiuna-associated bialielic markers, the PG1 
bialielic markers, and the new Apo E bialielic markers. 

In some embodiments, the step of determining the identity of the polymorphic base of one or more bialielic 
markers in the maps of the present invention, including any of the 653 bialielic markers obtained above (which include 
the sequences of SEQ ID Nos, V50 and 51-100 or the sequences complementary thereto), the asthma-associated 
biallotic markers, the PG1 bialielic markers, and the new Apo E bialielic markers, comprises conducting an amplification 
reaction on said nucleic acid sample using appropriate amplification primers and determining tho identity ul the 
polymorphic base in said one or more bialielic markers. In some embodiments, the identity of the polymorphic base may 
ha determined using appropriate microsequencing primers. 

As described herein, the diagnostics may be based on a single bialielic marker or a group of bialielic markers. 
Without wishing to be limited to any particular value, it is preferred that the bialielic marker used in single marker 
diagnostics either as a positive basis for further diagnostic tests or as a preliminary starting point for early preventive 
therapy, exhibit a p value in preliminary screening association analyzes of about 1 x 10' 2 or less. More preferably the p 
value is about 1 x 10 4 or less. 

Similarly, without wishing to be limited to any particular value for diagnostics based on more than one bialielic 
marker, it is preferred that the haplotype exhibit a p value of 1 x 10' 3 or less, still more preferably 1 x 1 0' 5 or less and 
most preferably of about 1 x 10 5 or less in a preliminary screening haplotype analysis. These values arc believed tn be 
applicable to any association studies involving single or multiple marker combinations, Significance thresholds may be 
refined according to the methods previously described. 

Example 32 describes methods for determining the bialielic marker pattern in a nucleic acid sample. 
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Example 32 

A nucleic acid sample is obtained from an individual to be tested for susceptibility to a detectable trait or for a 
detectable trait caused by a particular mutation. The nucleic acid sample may bo a RNA sample or a DMA sample. 

A PCR amplification is conducted using primer pairs which generate amplification products containintj the 
5 polymorphic nuefcotides of one nr more biallclic markers associated with such a predisposition or causative mutation. 

For example, the amplification products may contain the polymorphic bases of one or more of the bialleiic mm iters in the 
maps of the present invention, includiny any of the 653 bialleiic markers obtained above (which include the sequences of 
SEO ID Nos. 1-50 and 51-100 or the sequances complementary (hereto), (he asthma-assiiciated bialleiic markers:, the 
PG1 bialleiic markers, and the Apo E bialleiic markers or bialleiic markers in linkage disequilibrium with any of these 
10 bialleiic markers. In some embodiments, the PCR amplication is tunductsd using primer pairs which generate 

amplification products containing the polymorphic nucleotides of several bialleiic markers. For example, in one 
embodiment amplification products containing the polymorphic bases of one or moro bialleiic markers in ths maps of the 
present invention, including any of tfie 653 bialleiic markers obtained above (which include the sequences of SEQ ID 
Was. 1-50 and 51*100 or the sequences complementary thereto), the asthma-associated bialleiic markers, the PG1 
15 bialleiic markers, and the Apo E bialleiic markors, bialleiic markers which are in linkage disequilibrium therewith or with a 

causative mutation associated with a detectable phenotype may be generated, in another embodiment, amplification 
products containinrj the polymorphic bases of five or more bialleiic markers in the maps of the present invention, 
including any of the the 653 bialleiic markers obtained above (which include the sequences of SEQ 10 Nos. 1-50 and 51- 
100 or the sequences complementary thereto), the asthma-associated biallclic markers, the PG1 bialleiic markers, and 
20 the Apo E bialleiic markers, bialleiic markers which are in linkage disequilibrium therewith or with a causative mutation 

associated with a detectable phenotype may be generated. In another embodiment, amplification products containing the 
polymorphic bases of 20 or more bialleiic markers in the maps of the present invention, including any of the 653 bialleiic 
markers obtained above (which include the sequences of SEQ ID Nos. 1-50 and 51-100 or the sequences complementary 
thereto), the asthma-associated bialleiic markers, the PG1 bialleiic markers, and the Apo E bialleiic markers, bialleiic 
25 markers which are in linkage disequilibrium therewith or with the causative mutation may be generated. In another 

embodiment, amplification products containing the polymorphic bases of 100 or more bialleiic markers in the maps of the 
present invention, including any of the the 653 bialleiic markers obtained above (which include the sequences of SEO ID 
Nos, 1-50 and 51-100 or the sequences complementary thereto), the asthma-associated bialleiic markers, the PG1 
bialleiic markers, and the Apo E bialleiic markers, bialleiic markers which are in linkage disequilibrium therewith or with a 
30 causative mutation associated with a detectable phenotype may be generated. In another embodiment amplification 

products containing the polymorphic bases of 200 or more biallclic markors in the maps of the present invention, 
including any of the the 653 bialleiic markers obtained above {which include the sequences of SEQ 10 Nos. 1 *50 and 51 * 
100 or the sequences complementary thereto), the asthma-assoctatod bialleiic markers, the PG1 bialleiic markers, and 
the Apo E bialleiic markers, bialleiic markers which ere in linkage disequilibrium therewith or with a causative mutation 
35 associated with a detectable phenotype may he generated, in another embodiment, amplification products containing the 

polymorphic bases of 300 or more bialleiic markers in the maps of the present invention* including anv of the 653 
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biallelic markers obtained above (which include the sequences of SEQ ID Nos. 1-50 and 51-100 or the sequences 
complementary thereto), the asthma-associated biallelic markers, the PG1 biallelic markers, and the Apo E biallelic 
markers, biallelic markers which arc in linkage disequilibrium therewith or with the causative mutation may be 
generated. In another embodiment amplification products containing the polymorphic bases of 400 or more biallelic 
5 markers in the maps of the present invention, including any of the the 653 biallelic markers obtained above {which 

include the sequences of SEQ ID Nos. 1-50 and SHOO or the sequences complementary thereto), the asthma-associated 
biallelic markers, the PG1 biallelic markers, and the Ap« £ biallalic markup biallciic markers which tire in linkage 
disequilibrium therewith or with a causative mutation associated with a detectable phenotypc may be generated. 

The primers used to generate the amplification products may be designed as described herein. Representative 

10 amplification primers for generating amplification products containing the polymorphic bases of the biaJJclic markers of 

SEQ ID Nos. 1-50 and 51-100 are provided as SEQ ID Was. 101-1 5D/15 1-200 in the accompanying Sequence Listing 
The PCfi primers may bo oligonucleotides of 10. 15, 20 or more bases in length which enable the amplification of the 
polymorphic site in the markers. In some embodiments, the amplification product produced usiny these primers may be 
at least 100 bases in length (i.e. about 50 nucleotides on each side of the polymorphic base). In ether embodiments, the 

15 amplification product produced using these primers may be at least 500 bases in length (i.e. about 250 nucleotides on 

each side of the polymorphic base), (n still further embodiments, the amplification product produced using these primers 
may be at least 1000 bases in length (La, about 500 nucleotides on each side of the polymorphic base). 

Table 9 lists the internal identification numbers of the 50 localized markers described hemn and Urn Apo E 
markers described herein, the SEQ 10 Nos. for each of the two alleles of these bialJeiic markers, the SEQ ID Nos. of 

2D representative upstream and downstream amplification primers which can bo used to generate amplification products 

including the polymorphic bases of these biallelic markers, and the SEQ ID Nos ul microsaqrjoncinrj primers which can be 
used to determine the identias of the polymorphic bases of these markers. 

Tabic 10 

Marker SEdlQWos SEQ ID Nos SEQ ID Was 

25 (Genset code! First Second Amplification primers Microsequoncinrj primers 

allele allele Upstream Downstream 1 2 
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99-2647 49 99 149 198 249 239 

39-2649 50 100 1 50 200 250 300 

It will be appreciated that the primers listed in Tabic 9 arc merely exemplary and that any other set of primers 
which produce amplification products containing the polymorphic nucleotides uf one or mora of the biallelic markers of 
SEQ ID Nos: 1*50 and 51*100 or bialteltc markers in linkage disnijtiilibrium therewith or with a causative mutation fur a 
detectable trait, or a combination thereof may be used in the diagnostic methods. It will also bo appreciated that these 
diagnostic methods may be performed with any hiallelic marker or combination of biallelic markisrs included in the maps 
of the present invention. 

Following the PCH amplification, tha identities of the polymorphic bases of one or more of the biallelic markers 
in the nucleic acid sampte are determine! The identities of tho polymorphic bases may be detetmuuni usinfl the 
microsoquencing procedures described in Example 13. it will be appreciated that the microscquencing primers listed as 
SEQ IB NOs: 201*250 and 251-300 are merely exemplary and that any primer having a 3' end near tha polymorphic 
nucleotide, and preferably immediately adjacent to the polymorphic nucleotide, may be used. Similarly, it will be 
appreciated that microsequencinrj analysis may be performed for any marker or combination of markers in the m3ps of 
the present invention. 

Alternatively, the microsequencing analysis may be performed as described in Pastincn at al., Gcnoma 
Research 7:606-614 (19971, the disclosure of which is incorporated herein by reference, and which is described in more 
detail below. 

Alternatively, the PGR product may be completely sequenced to determine the identities of the polymorphic 
bases in the biallelic markers. In another method, the identities of the polymorphic bases in the biallelic markers arc 
determined by hybridizinrj the amplification products to microarrays containing allele specific olirjnonucleottdes specific 
for the polymorphic bases in the biallelic markers. The use of microarrays comprising allele specific oligonucleotides is 
described in mora detail beiow. 

It will be appreciated that tha idontities of the polymorphic bases in the biallelic markers may be determined 
using techniques other than those listed above, such as conventional dot blot analyzes. 

Nucloic acids used in the above diagnostic procedures may comprise at least 10 consecutive nucleotides, 
including the polymorphic bases, of the biallelic markers in tha maps of the present invention, including any of the 653 
bialteiic markers obtained above (which include the sequences of SEQ ID Nos. 1*50 and 51-100 or the sequences 
complementary thereto}, the asthma-associated biallelic markers, the PG1 biallelic markers, and the new Apo E biallelic 
markers, including those of SEQ ID Nos. 301-305(307*31 1 or the sequences complementary thereto. Alternatively, the 
nucleic acids used in the above diagnostic procedures may comprise at least 15 consecutive nucleotides, including the 
polymorphic bases, of the biallelic markers in the maps of the present invention, including any of the 653 biallelic 
markers obtained above {which include the sequences of SEQ ID Nos, 1*50 and 51-100 or the soquences complementary 
thereto), the asthma-associated bialletic markers, the PG1 biallelic markers, and the new Apo E biallelic markers, 
including those of SEQ ID Nos. 301-3051307-311 or the soquences complementary thereto. In some embodiments, the 
nucleic acids used in the above diagnostic procedures may comprise at least 20 consecutive nucleotides, including the 
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polymorphic bases, of the bialleiic markers in the maps of the present invention, including any of the 653 bialleiic 
markers obtained above (which include the sequences of SEQ ID Nus. 1-50 and 5 MOO or the sequences complementary 
thereto), the asthma-associated biotlelic markers, the PG1 bialleiic markers, and the new Apo E bialleiic markers, 
including those of SEQ ID Nos. 30V305/307-311 or the sequences complementary thereto, In still other embodiments, 
the nucleic acids used in the above diagnostic procedures may comprise at least 30 consecutive nucleotides, includiny 
the polymorphic bases, of the bialleiic markers in the maps of the present invention, including any of the 653 biullalic 
markers obtained above (which include the sequences of SEO ID Nos. 1-50 and 51-100 or tho sequences complementary 
thereto), the astlwia-associatod bialleiic markers, the PG1 bialleiic markers, and the new Apo E btalioiic markers, 
Including those of SEQ ID Nos- 301-305/307-31 1 or the sequences complementary thereto, in further embodiments, the 
nucleic acids used in the above diagnostic procedures may comprise more than 30 consecutive nucleotides, including the 
polymorphic bases, of the bialleiic markers in the maps of the present invention, includiny any of the the 053 bialleiic 
markers obtained above (which include the sequences of SEO ID Nos. 7-50 and 5 MOO or the sequences complementary 
thereto}, the asthma-associated bialleiic markers, the PG1 bialleiic markers, and the new Apo E bialleiic markers, 
including those of SEO ID Nos. 3Ql-305j307-311 or the sequences cumpiemontory thereto. In still further embodiments, 
tho nucleic acids used in the above diagnostic procedures may comprise the entire sequence of the bialleiic markers in 
the maps of the present invention, including any of the the 653 bialleiic markers obtained above (which include the 
sequences of SEQ ID Nos. 1-50 and 5M00 or the sequences complementary thereto), the asthma-associated bialleiic 
markers, the PG1 bialleiic markers, and the new Apo E bialleiic markers, including those of SEQ ID Nus. 301 -305/307- 
31 1 or the sequsncos complementary thereto. In some embodiments the nucleic acids used in tho diagnostic procedures 
ore longer than the sequences of SEQ ID Nos. 1-50, 5M00, 301-305 and 307-11 because they contain nucleotides 
adjacent to these sequences. 

The diagnostics of the present invention may stso employ nucleic acid arrays attached to DNA chips or any 
other suitable solid support, including beads. As used herein, the term array means a one dimensions, two dimensional, or 
multidimensional arrangement of a plurality of nucleic acids of sufficient length to permit specific detection of nucleic acids 
capable of hybridizing thereto. 

DNA chips allow the integration of micro-biochemical processes (such as DMA hybridization), systems of signal 
detection (such as fluorescence) and data processing into a single system which can be used to obtain information on 
polymorphism. The solid surface of the chip is often made of silicon or glass but it can be a polymeric membrane. 
Efficient access to polymorphism information is obtained through a basic structure comprising high-density arrays of 
oligonucleotide probes attached to a solid support (the chip) at selected positions. The immobilization of arrays of DNA 
probes on solid supports has been rendored possible by the development of a technology generally identified as "Very 
Large Scale Immobilized Polymer Synthesis" (VLSIPS™) and in which, typically, probes are immobilized in a high density 
array on a solid surface of a chip, Examples of VLSIPS™ technologies are provided in US Patents 5,143,854 end 
5,412,087 and in PGT Publications WO 90/15070, WO 92/1Q092 and WO 95111995, the disclosures of which are 
incorporated herein by reference, which describe methods for lorming oligonucleotide arrays through techniques such as 
fipht-directed synthesis techniques. 
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In designing strategies aimed at providing arrays of nucleotides immobilized on solid supports, further 
presentation strategies were developed to order and display (he probo arrays on tho chips in an attempt to maximize 
hybridization patterns and sequence information. Examples of such presentation strategies arc disclosed in PCT 
Publications WO 94/12305. WO 34/11530, WO 97/29212 and WO 97/31200, the disclosures of which arc incorporated 
5 herein by reference. 

Each DNA chip can contain thousands to millions of individual synthetic ONA probus arranged in a grid-like 
pattorn and miniaturized to the size of a dime 

The chip technology has been successfully used to detect mutations in numerous cases. Fur exampla, the 
screening of mutations has been undertaken in the DHCA1 gene, in S. cs/wfs&e mutant strains, mi in the protease 
10 gsno af HW-1 virus (see Itacia et al, Nat. Genet. 14:441-447(1990); Shoemaker et al.. Nat. Genet. 14:4504513 (1990); 

Kozal et al. ( Ai?f, Mod, 2:753-759 (1996), the disclosures of which aro incorporated herein by reference). At least three 
companies propose chips aLlo to detect MaKcfe polymorphisms: Affymetrix (GencChip), Hyseq (HyChip and HyGuustics), 
and f'rotogono Laboratories. 

in some embodiments, the efficiency of hybridization of nucleic acids in the sample with the probes attached to 
15 the chip may be improved by using polyacrylamide gel pads isolated from one another by hydrophobic regions in which 

the DNA probes are covalently linked to an acryiamide matrix. 

The polymorphic bases present in the biallelic morkBr or markers of the sample nucleic acids are determined os 
follows. Probes which contain at least a portion of one or more ol the biatlelic markers of the present invention are 
synthesized either in situ or by conventional synthesis and immobilized on an appropriate chip using methods known to 
20 the skilled technician. 

The nucleic acid sample which indudes the candidate region to be anafyzeid is isolated, amplified with primers 
capable of generating an amplification product containing the polymorphic bases of one or more biallelic markers, and 
labeled with a reporter group. The reporter group can be a fluorescent group such as phycoerythnn. The labeled nucleic 
acid is then incubated with the probes immobilized on the chip using a fluidics station. For example, Manz et al. [Avd. in 
25 Chromstogr, 33:1*66 0993), the disclosure of which is incorporated herein by reference) describe the fabrication of 

fluidics devices and particularly microcapillary devices, in silicon and glass substrates. 

After the reaction is completed, the chip is inserted into a scanner and patterns of hybridization are detected. 
The hybridization data is collected as a signal emitted from the reporter groups already incorporated into the nucleic 
acids generated in ths amplification of the sample DNA, which is now bound to the probes attached to the chip. Probes 
30 that perfectly match a sequence of the nucleic acid sample generally produce stronger signals than those that have 

mismatches. Since the sequence and position of each probe immobilized on the chip h known, the identity of the nucleic 
acid hybridized to a given probe can be determined. 

For single-nucleotido polymorphism analyzes, sets of four oligonucleotides are generally designed (one for each 
possible base) that span each position of a portion of the candidate region found in the nucleic acid sample, differing only 
35 in the identity of the central base, The relative intensity of hybridization to each series of probes at a particular location 

allows the identification of the base corresponding to the central base of the probe. For example, to detect sinqle 
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nudeotide polymorphisms such as those in the present biallelic markers, oligonucleotides having each of the two allelic 
bases at their central position are affixed to tho chip. The amplification products resulting from amplification of the 
nucleic acids in the sample ore hybridized to the chip under high stringency (at lower salt concentration and higher 
tomperaturc over shorter time periods) to facilitate specific detection of tho polymorphic sequences present in tho 
nucleic acid sample. 

The uso of direct ulectric field control improves the determination of single basu mutations (h);iiiou«n). A 
positive fiold increases the transport rato of negatively charged nucleic acids and results in a 10-fold inerense of the 
hybridization rates. Using this technique, single base pair mismatches are detected in less than IS sue (see Sosnowski et 
al., Proc. N21L Ac*d ScL USA 94:1119-1123 (1 997), the disclosure of which is incorporated herein by reference}. 

Another technique wliich can be used to analyze polymorphisms includes multicomponent integrated systems 
which miniaturize and compartrrientaOzG processes such as restriction enzyme digestion, PCR reactions, and capillary 
electrophoresis in a single functional device. An example of such technique is disclosed in US patent 5,589,136, the 
disclosure of which is incumoiated herein by reference, which concams the integration of PCfl amplification and 
capillary electrophoresis in chips. Integrated systems are best applied with microfluidic systems. Those systems 
comprise a pattern of microchannels designed onto a glass, silicon, quartz, or plastic wafer included on a microchip. The 
movements of the samples are controlled by electric forces applied across different areas of tho microchip to create 
functional microscopic valves and pumps with no moving parts. Regulating or varying the voltage controls tf 10 liquid flow 
at intersections between the micro- mo chined channels and changes the liquid flow rote for pumping acruss different 
sections of the microchip. 

In the caso of biallclic marker analyzes, the micro-chip integrates nucleic acid amplification, a microsequencing 
reaction (such as the one described above), capillary electrophoresis and a detection method such as loser-induced 
fluorescence detection. 

In a first step, the DNA samples are amplified, preferably by PCR. Then, the amplification products are 
subjected to automated microsequoncing reactions using ddNTPs (specific fluorescence for e3ch ddNTP) and the 
appropriate oligonucleotide microsequencing primers which hybridize just upstream of tha targeted polymorphic base. 
The microsequencing reactions may employ primers capable of being extended to the polymorphic bases of the biallefic 
markers. Preferably, the microsequencing primers comprise a sequence terminating at the base immediately preceding 
the polymorphic base of tho bialfelic markers. Once the extension at the 3' end is completed, tho primers are separated 
from the unincorporated fluorescent ddNTPs by capillary electrophoresis. The separation medium usod in capillary 
electrophoresis can for oxample be pofyacrylamide, polyethylenoglycol or dextran. The incorporated ddNTPs in the simjle- 
nucleotide primer extension products are identified by fluorescence detection. Preferably, the micro-chip can he used to 
process at least 98 samples in parallel Mora preferably, the micro chip can be used to process at least 3S4 samples in 
parallel. Preferably, the microchip is designed for use with detection procedures using four color laser induced 
fluorescence detection of the ddNTPs. 



Any ona or more alleles of the biaWic markers in the maps of the present invention, or fragments thereof 
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containing the polymorphic basos, may he fixed to a solid support, such as a microchip or other immobilizing surface. The 
fragments of these nucleic acids may comprise at least 10, at least 15, at least 20, at least 25, or more than 25 
consecutive nucleotides of tlic oialialic markers described herein. Preferably, the fragments include the polymorphic bases of 
the biallelic markers. 

A nucleic acid sample is applied to the immobilizing surface and analyzed lo determine the Wonlios uf the 
polymorphic bases of one or more of the biallelic markers. In seme embodiments, the solid support may also include one or 
more of the amplification primers described herein, or fragments comprising at least 10, at least 15, or at loast 20 
consecutive nucleotides thereof, for generating an amplification product containing the polymorphic bases of the biallelic 
markers to be analy2od in the sample. 

Another embodiment of the present invention \$ a solid support which includes one or more of tho micrusLMpjcinciny 
primers listod as in the accompying Sequence Listing, or fragments comprising at least 10, at least 15, or at least 20 
consecutive nucleotides thereof and having a 3' terminus immediately upstream of the polymorphic base of the 
corresponding biallelic marker* for determining the identity of the polymorphic base of the one or more biallelic markers fixed 
to the solid support. 

For example* one embodimont of tho present invention is an array of nucleic acids fixed to a solid suppoit, such as 
D microchip, bead, or other immobilizing surface, comprising one or more of tho biallelic markers in the maps of the present 
invention or a fragment comprising; at ioast 10, at least 15, at teast 20, at least 25, or more than 25 consecutive nucleotides 
thereof including the polymorphic base. For example, the array may comprise one or mare of any of the 853 biallelic 
markers obtained above (which include the sequences of SEQ ID Nos. 1*50 and 51-100), the asthma-associated biallelic 
markers, the PG1 biallelic markers, and the new Apo E biallelic markers (including SEQ ID Nos. 301-305/307-31 1) or the 
sequences complementary thereto, or a fragment comprising at least 10, at least 15, at least 20, at least 25, or more than 
25 consecutive nucleotides thereof including ths polymorphic base. In a further embodiment, the array comprises at least 
five of the biallolic markers in the maps of the present invention or a fragment comprising at least 10, at least 15, at least 
20, at least 25, or more than 25 consecutive nucleotides thereof including the polymorphic base. For example, the arrays 
may comprise at least five of any of the 653 biallelic markers obtained above (which include the sequences of SEQ ID 
Nos. 1-50 and 51-100), the asthma-associated biallelic markers, the PG1 biallelic markers, and the now Apo E biallelic 
markers (including the sequences of SEQ ID Nos. 301-305/307*311) or the sequences complementary thereto, or a 
fragment comprising at least 10, at least 15, at least 20, at least 25, or mora than 25 consecutive nucleotides thereof 
including the polymorphic base. In a further embodiment the array comprises at least 10 of the biallelic markers in the 
maps of the present invention or a fragment comprising at least 10, at least 15, at least 20, at least 25, or more than 25 
consecutive nucleotides thoreof including the polymorphic base. For example, the array may comprise at least 10 of any of 
the 653 biallelic markers obtained above {which include the sequences of SEQ 10 Nos. 1-50 and 51-100), the asthma- 
associated biallelic markers, the PG1 biallelic markers, and the new Apo E biallelic markers (including tha sequences of 
SEQ ID Nos. 30 1-305/307-31 1) or the sequences complementary thereto, or a fragment comprising at leas 1 1 D, at least 1 5, 
at least 20, at least 25, or more than 25 consecutive nucleotides thereof including the polymorphic base, In a further 
embodiment the array comprises at least 20 of the biallelic markers in the maps of the present invention or a f raument 
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comprising at (east 15 consecutive nucleotides thereof including the polymorphic base. For example, (he array may comprise 
at least 20 of any cf the G53 liiallalic markers obtained above (which include tho sequences of SEQ 10 Nos. 1-50 and 51- 
100), the asthma-associated biallelic markers, the PG1 biallelic markers, and the now Apo E biallelic markers {including 
the sequencos of SEQ ID Nos. 301-305(307-311) or the sequences complementary thereto, or a fragment comprising at 
least TO, at least t5, at least 20, at least 25. or more than 20 consecutive nucleotides thereof including, the polymorphic 
base. In a further embedimnnt the array comprises at least 100 of tho biallelic markers in thu maps of the present 
invention or a fragment comprising at least 10, at least 1 5, at least 20, at least 25, or mote than 25 consecutive nucleotides 
thereof including the polymnrjiliic base. For example, the array may comprise at loast 100 of eny of thu G53 faiallolic 
markers obtained above {which include the sequences of SEQ ID Nos. 1-50 and 5 MUG), the asthma-associated bialtelic 
markers, the PGl biatielic markers, and the new Apo E biallelic markers (including the sequences of SEQ ID Nus. 301- 
305/307-311) or the saquonccs complementary thereto, or a fragment comprising at least 10, at least 15, at least 20, at 
least 25. or more than 25 consecutive nucleotides thereof including the polymorphic base. In a further embodiment the 
array comprises at least 200 of the biallelic markers in the maps Df the present invention or a fragment thereof comprising 
at Ie3st 10, at least 15, at least 20. at loast 25, or more than 25 consecutive nucleotides thereof including the polymorphic 
base. For example, the array may comprise at least 200 of any of the G53 biallelic markers obtained above (which include 
the sequences of SEQ ID Nos. 1-50 and 51-100], the asthma-associated bialielic markers, the PG1 biallelic markers, and 
the new Ape E biallelic markers (including the sequences of SEQ ID Nos. 3O1-305/3D7-31 1) or the sequences 
complementary thereto, or a fragment comprising at least 10 f at least 15, at least 20, at loast 25, or more than 25 
consecutive nucleotides thereof including the polymorphic base. In a further embodiment the array comprises at least 300 
of the biallelic markers in the maps of the present invention or a fragment comprising at least 10, at least 15, ot least 20, at 
least 25, or more than 25 consecutive nucleotides thereof including the polymorphic base. For example, the array may 
comprise at least 300 of any of the 653 biallelic markers obtained above (which include the sequences of SEQ 10 Nos. 1* 
50 and 51-100), the asthma-associated biallelic markers, the PG1 biallelic markers, and the new Apo E biallelic markers 
(including tha sequences of SEQ ID Nos. 301-305/307-3111 or the sequences complementary thereto, or a fragment 
comprising at least ID, at least 15. at loast 20, at least 25, or more than 25 consecutive nucleotides thereof including the 
polymorphic base. In a further embodiment the array comprises at least 400 of tha biallelic markers in tha maps nf the 
present invention or a fragment comprising at least 10, at least 15, at least 20, at least 25, or more than 25 consecutive 
nucleotides thereof including the polymorphic base. For example, the array may comprise at least 400 of any of the 653 
biallelic markers obtained above (which include the sequences of SEQ ID Nos. 1-50 and 51-100), the asthma-associated 
biallelic markers, the PG1 biallelic markers, and the new Apo E biallelic markers {including the sequences of SEQ ID Nos. 
301-3051307-311) or the sequences complementary thereto, or a frarjmont comprising at least 10, at least 15, at least 20, 
at least 25* or more than 25 consecutive nucleotides thereof including the polymorphic base. In a further embodiment the 
array comprises more than 400 of the bialtelic markers in the maps of the present invention or a fragment comprising at 
least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides thereof including the polymorphic 
base. For example, the array may comprise at least 40D of any of tha 853 biallelic markers obtained above (which include 
the sequences of SEQ ID Nos. 1-50 and 51-100), the asthma-associated biallelic markers, the PG1 biallelic markers, and 
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the new Apo E biallelic markers (including the sequences of SEO ID Nos. 301*305/307-31 1) or the sequences 
complementary thereto, or a fragment comprising at least 10, at least 15, at least 20, ot loast 25, or mors than 25 
consecutive nucleotides thereof including the polymorphic base. Each of tho embodiments listed above may also include one 
ar more of the sequences of SEO ID Nos. 30S and 312 in addition to those enumerated above. 

Another embodiment of tho present invention is an array comprising amplication primers Fur (junerating 
amplification products containing, tlm polymorphic bases of one ur more, at least five, at lunst 10, at teast 20, ot least 100, 
at least 200, at least 300, at toast 400, or more than 400 of tlm biallelic markers in the maps of the present invention. For 
example, the array may comprise amplification primers for generating amplification products containing tha polymorphic 
hases of one or more, at least five, ot least 10, at loast 21), at least 100, at least 200, at least 3UU, at (oast 400, ur inure 
than 400 of any of the 653 biallolic markers obtained abovo (which include ths sequences of SEQ ID Nus. 1-50 and 51* 
10Q or the sequences complementary thereto), the asthma associated biallelic markers, the PGl bialkilic markers, and 
the new Apo E biallelic markers (including the sequences of SEQ ID Nos. 301-305/307*311 or tho sequences 
complementary thereto), in such arrays, the amplification primers included in the array arc capable of amplifying the 
biallelic marker sequences to be detected in the nucleic acid sample applied to tha array (i.e. the amplification primers 
correspond to the biallelic markers affixed to the array). For example, if the array is designed to detect the biallelic marker of 
SEQ ID Nos. 1 and 51 it may also contain SEQ ID Nos, 101 and 151, the amplification primers capable of generating an 
amplicon which includes sequence ID Nos. 1 and 51. Thus, the arrays may include one or more ol the amplification primers 
of SEQ ID Nos. 1 01-200, 313-317, and 319-323 corresponding to the one or more biallelic markers of SEQ 10 Nos, 1-50, 
51-100. 301-305, and 307-311 which are included in the array, In other embodiments, the arrays may include 
amplification primers capable of generating an amplification product which includes the biallelic markers SEQ ID Nos. 
306 and 312 in addition to amplification primers capable of generating an amplication product containing each of the 
markers enumerated above. Thus, in such embodiments, the arrays may further include the amplification primers of SEQ 
ID Nos. 31 Band 324. 

Another embodiment of the present invention is an array which includes microsequancing primers capable of 
determining the identity of the polymorphic bases one or more, at least five, at least 10, at least 20, at least 100, at least 
200, at least 300, at least 400, or more than 400 of the biallelic markers in the maps of the present invention. For 
example, the array may comprise microsequencing primers capable of determining the identity of the polymorphic bases of 
one or more, at least five, at least 10, at least 20, at least 100, at least 200, at least 3QQ, at least 400. or more than 400 
of the 653 biallefic markers obtained above (which include the sequences of SEQ IQ Nos. V50 and 5H0Q w the 
sequences complementary thereto), the asthma-associated biallelic markers, the PG1 biallelic markers, and tho new Apo 
E biallelic markers (including the sequences of SEQ ID Nos. 301-305/307-31 1 or the soquenccs complementary thereto). 
Tha sequences of representative microsequencinrj primers which may be included in the array are listad in the sequence 
listing as SEQ ID Nos, 2Q1-30O. 325-329, and 331-335. In other embodiments, the arrays may further include 
microsequencing primers for determining the identity of the polymorphic bases of one or more of the sequences of SEO 
ID Nos, 306 and 312, such as the microsequencing primers of SED ID Nos, 330 and 336. 

Arrays containing any combination oi the above nucleic acids which permits the specific detection or 
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identification nf the polymoiphic bases of the biallelic markers in the maps of the present invention, includino any 
combination of the 653 biallclic markers obtained above [which include iho sequences of SEQ 10 Nos. 1-5G and 51-100 
or tho sequences complementary thereto), the asthma-associated biallclic markers, the PG1 biallelic markers, and the 
new Apo E bialletin markers (including the sequences of SEQ ID Nos. 301-305/307*31 1 or the sequences complementary 
5 thereto) are also within the scope of the present invention. Other embodiments ol the arrays include nucleic acids which 

permit the specific detection or identification of the polymorphic bases of one or more of SEQ (D Nos. 306 ami 312 in 
addition to ths nucleic acids permitting the specific detection or idcniicatian of Urn polymorphic bases of the biallelic 
markers listed in the preccdirirj sentence, for example, the array may comprise both the biallclic markers and 
amplification primers capable of generating amplification products cuntaining the polymorphic bases of the biallalic 
10 markers. Alternatively, the array may comprise both amplification primers capable of generating amplification products 

containing the polymorphic bases of the biallclic markers and microsequoncing primers capable of detemininrj the 
identities of the polymorphic bases of these markets. 

Although the above examples describe arrays comprising specific groups of biallelic markers and, in sqiuu 
embodiments, specific amplification primers and microsequencing primers, it will be appreciated that the present 
15 invention encompasses arrays including any biallelic marker, group of biallclic markers, amplification primer, group of 

amplification primers, microsequencing primer, or group of amplification primors described herein, as wall as any 
combination of the preceding nucfoic adds* 

Alternatively, the microsequencing procedures described above may be used to determine whether an individual 
possesses a pattern of biallelic marker alleles associated with a detectable trait, in this approach, a PGR reaction is 
20 performed on the ONA or RNA of the individual to be tested to amplify the desired biallelic markers or portions thereof. Tho 

amplification product is hybridized to one or moro oligonucleotides having their 3' end ano'base from the position of the 
polymorphic basos of the biallelic markers which are fixed to a surface. The oligonucleotides are extended ona base using a 
detectably labeled dNTP and a polymerase. Incorporation of a pattern of detectably labalod bases indicative of a biallelic 
marker pattern associated with a detectable trait indicates that the individual suffers from a detectable trait as the result of 
25 a particular mutation or that the individual is at risk for developing the detectable trait at a subsequent time. 

In addition to their use in diagnostic techniques such us those described above, any of the arrays described above 
may also be used ta identify a haplotype 0.8. a set of alleles of biallelic markers) which is associated with a particular trait. 
As described above, in such analyses nucleic acid samples are obtained from trait positive and trait negative individuals and 
1he alleles of biallelic markers present in each population are determined to identify a haplotype which is statistically 
30 associated with the trait. The arrays may be employed in haplotype analyses as follows. Nucloic acid samples obtained 

from trait positiva and trait negative individuals are amplified with primers capable of generating amplification products 
which include the polymorphic basos of the biallelic markers. The amplification products are labeled wfth a reporter group 
and allowed to contact the biallelic marker probes which are attached to the support. As described above, Hie biallelic 
marker probes to which the labeled amplification products specifically hybridize are determined to indicate which alleles of 
35 the biallelic markers are present in the samples, Ths patterns of alleles of biallelic markers in the trait positive and trait 
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negative individuals are thon determined to identify a haplotype having a statistically significant association with tho trait. 

Alternatively, as described above, the nucleic acid samples fiorn trait positive and trait negative individuals may be 
applied to on array comprising amplification primers capable of generating amplification products which include the 
polymorphic bases of the bialleiic markers. The identities of the polymorphic bases in the amplification products are then 
5 determined using techniques such as the microscquenring procedures disclusod herein. Alternatively, amplification can be 

conducted in liquid phase and micros e^uanc in g may he conducted on the array. 

Alternatively, hath amplification and microsequencing reactions may be perfnnned in liquid phase. In such 
embodiments, the labeled nucleotides incorporated in the microsequencing primers during the micrcsGquencing reactions are 
detected by hybridizing the extended rnicrnsequenctng primers ta sequences complementary to the microsequencing printers. 
10 The sequences complementary to the microsoquencing primers are immobilized on a support, such as those described above. 

The amplification and microsequencing reactions performad in liquid phase may be multiplexed, allowing the samples to be 
tested simultaneously for tens, hundreds, thousands or more bialleiic markers. 

Preferably, the array used in tho haplotype analysis comprises one or more groups of bialleiic markers known to be 
located in proximity to one another in the genome. For example, the bialleiic markers in the groups may be derived from s 
15 single YAC insert, a single BAG insert or a BAC subclone. Alternatively, tha bialleiic markers in the groups may be derived 

from adjacent ordered clones. Tho biallolic markers in the groups may be located within a genomic region spanning less than 
1 kb, from 1 to 5kb, from 5 to 10kb, from 10 to 25kii, from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 
BCOkb, from SOOkb to 1Mb, or more than 1Mb. In some embodiments, the bialleiic markers in the groups comprise bialleiic 
markers which have been localized to the same chromosome, subchromosomal region, or gene, 
20 It will bo appreciated that the ordered DNA containing the biallefic markers need not completely cover tfie genomic 

regions of these lengths but may instead be incomplete contjgs having one or more garis therein. 

In some embodiments, tha biaUolic markers known to be located in proximity to one another in the genome may bo 
located in physical proximity on the array. For example, the array may comprise one or more groups of at least 3 bialleiic 
markers known to be located in proximity to one another in the genome. In some embodiments, the array may comprise one 
25 or mare groups of at least 6 bialleiic markers known to be located in proximity to one another in the genome. In other 

embodiments, the array may comprise one or more groups of at least 20 bialleiic markers known to be located in proximity 
to one another in the genome. 

The anay may comprise one or more groups of bialleiic markers known to be locatod on the same subchromosomal 
region. Far example, the array could comprise two or mors biallefic markers located at 21qtl.2 ( selected from the group 
30 consisting of SEQ ID Nos. 29, 79, 30 and BO ), two or more markers located at 2tq21 (selected from the group consisting of 

SEQ ID Nos 1, 51. 2. 52, 3 and 53), two or more markers located at 21q21.2 (selected from the group consisting of SEQ ID 
Nos 17, 67, 18, 68, 19, 69. 20, 70, 21, and 71) , two or more markers located at 21q21.3-q22.13 (selected from the group 
consisting of SEQ. 10 Nos 25, 75, 26, 78, 27, 77, 28, 78, 31, 81, 32, 82, 38, 88, 39, 89, 40. 90, 48, 98, 49, 99. 50. 100, 
22, 72, 23, 73, 24, 74, 4, 54, 5, 55. 6, 56. 7, 57, 8, 58, 9, 59, 10, 60, 11, 61, 12, 62, 13, 63, 14, 64, 15, 65, 1 6, and 66 
35 ), two or more maikers located at 21q22.2 (selected from the group consisting of SE0 10 Nos 41, 91 , 42, 92, 43, 83. 44. 

94, 45. 95. 48. 9B. 47. and 97) , and two or more markers located et 21a22.3 (selected from the group consistina of SEQ 
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10 Nos 33, 83, 34, 84, 35, 85, 36, B6, 37, and 37). Alternatively, the array could comprise amplification primers capable of 
generating an amplification product containing the polymorphic bases of two or more hialtelic markers located at 21q1 1,2 ( 
for example, amplification primers capable of generating an amplification product containing tfie polymorphic bases of two or 
more biallelic markers selected from tho group consisting of SEQ ID Nos. 29, 73, 30 and 80 I, two or more markers located 
at 21q21 (far example, amplification primers capable of generating an amplification product containing the polymorphic 
bases of two or more biallelic markers selected from the group consisting of SEQ ID Nus 1,51. 2, 52, 3 and 53), two or 
more markers located at 2lq212 (for example, amplification primers capable of generating an amplification printout 
containing the polymorphic basos of two or more biallelic markers soloctcd from the group consisting of SEQ ID Nos 17, 67, 
18, G8, 19, 69, 20, 7D, 21, and 71) , two or more markers located at 21q2I.3-q22.13 {for example, amplification primers 
capable of generating on amplification product containing the polymorphic bases of two or moro biallelic markers selected 
from the group consisting of SEQ ID Nos 25, 75, 26, 76. 27, 77, 28, 78, 31. 81, 32, 82, 38, 88, 33, 89, 40, 9Q, 48, 98, 49, 
99, 50, ICQ, 22, 72, 23, 73, 24, 74, 4, 54, 5. 55, 6, 56, 7, 57, 8, 58 ( 9, 59, 10, 60, 11, 61, 12, 62, 13, 63, 14, 64, 15, 
B5, 16, and GG ), two or mora markers located at 2tq22.2 ( fcr example, amplification primers capable of generating an 
amplification pro duct containing the polymorphic bases of two or more biallelic markers soloctetl from the group consisting 
of SEQ ID Nos 41, 91, 42, 92, 43, 93, 44. 94, 45, 85, 4G, 9G, 47, and 97) , and two or more markers located at 21q223 
(for example, amplification primers capable of geneiattng an amplification product containing the polymorphic bases of two 
or more biallolic markers selected from the group consisting of SEQ ID Nos 33, 03, 34, 04, 35, 05, 36, 86, 37, and 87). 

In some embodiments, the array may comprise one or more groups of bialleGc markers derived from the same BAC 
insert. For example, the array could comprise two or more markers selected from the group consisting of SEQ ID Nus. 20, 
79, 30, and 80 (derived from BAC 1), two or more markers selected from the group consisting of SEQ ID Nas, 1 and 51 
[derived from BAG 2), two or more markers selected from the group consisting of SEQ )D Nos. 2 , 52, 3, and 53 {derived 
from BAG 3), two or more markers saiacted from the group consisting of SEQ ID Nos. 17, 67, 18, 68, 19, 69, 20, 70, 21, 
and 71 (derived from BAC 4), two or more markers selected from the group consisting of SEQ 10 Nos. 25, 75, 28, 76, 27, 
and 77 {derived from BAC 5), two or more markers sleeted from the group consisting of SEQ ID Nns. 28, 78, 31,81, 32, and 
82 (derived from BAC 6), two or more markers selected from the group consisting of SEQ 10 Nos. 38, 88, 33, 69, 40, and 
90 (derived from BAC 7), two or more markers selected from the group consisting of SEQ ID Nos. 48, 98, 49, 99, 50, and 
100 (derived from BAC 8), two or more markers selected from the group consisting of SEQ IQ Nos. 22, 72, 23, 73, 24, and 
74 (derived from BAC 9), two or mora markers selected from the group consisting of SEQ ID Nos. 4, 54, 5, 55, 6, 56, 7, 57, 
8, 58, 9, 59, 10 # and 60 (derived from BAC 10), two or more markers selected from tho group consisting of SEO ID Nos. 
11, 81, 12. 62, 13, 63, 14, 64, 15, 65, 16, and 66 (derived from BAC 111, two or more markers selected from the group 
consisting of SEQ 10 Nos. 41, 91, 42, 92, 43, 93, 44, 94, 45, 95. 46. 96, 47, and 97 (dorived from BAC 1 2), or two or more 
markers selected from the group consisting of SEQ ID Nos. 33, 83, 34, 64, 35, 85, 36, 86, 37, and 87 (dorived from BAC 
13). 

Arrays comprising biallelic markers known to be located in proximity to one another in the genome permit 
hapfotyping analyses to be conducted mn when the chromosomal locations of the biallolic markers has not been 
determined. For example, using the procedures described above, the alleles of sets of biallelic markers which are present in 
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nucleic acid samples from trait positive and trait negative individuals may be determined using a succession gf arrays, with 
each array having one or more groups of nuclaic acids known to be located in proximity to one another tlioruun. The 
succession of arrays may comprise biallelic markers spanning the entiro genome having any of the average iniennarker 
distances specified abovo. Alternatively, the succession of arrays need not span the entire rjenome but may instead be 
darived from two or more cmitigated YAC, BAG, or BAG subclone inserts. A statistical analysis is perfonnod on tiiu alleles 
of biallolie markers present in the trait positive and trait negative individuals to identify n haptotypa having a statistically 
significant association with the trait. Once a statistically significant ttaplotype is identified, the genomic locations of the 
biallelic markers comprising the haplotype may be determined usintj the methods described herein, In addition, using the 
procedures described herein, the tjenutnic region harboring (he bialldRc makers in the statistically significant hnplu tyjitf may 
be evaluated to identify the genes associated with the trait. 

Although this invention has boon doscribod in terms of certain preferred embodiments, other embodiments which 
wilt he apparent to those of ordinary skill in the art in view of the disclosure herein arc also within the scope uf this 
invention. Accordingly, the scope of tho invention is intended to he defined only by reference to the appended claims. 
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Table 1 



BlaHelic marker 
(Gensst code) 


BAC 


Insert size 
(kb) 


average Intermarker 
distance (kb) 


subchromosomal 
localization 




99-2378 
99-2381 


1 
1 


150 
150 


75 
75 


21q11.2 
21q11.2 



| 99-2103 2 110 110 21q2T 



99-2228 3 105 52.5 21q21 
99-2229 3 105 52JS 21q21 



99-2312 


4 


130 


26 


2lq21.2 


99-2315 


4 


130 


26 


21q2l.2 


99-2320 


4 


130 


26 


2lq21.2 


99-2321 


4 


130 


26 


21q21.2 


99-2324 


4 


130 


26 


21q2L2 



99-2362 


5 


100 


33.3 


21q2l.3-q22.13 


99-2364 


5 


100 


33.3 


2lq21.3-q22.13 


99-2367 


5 


100 


33.3 


21q2L3-q22.13 



99-2371 


6 


135 


45 


21q22.11-q22.13 


99-2413 


6 


135 


45 


21q22,11-q22.13 


99-2419 


6 


135 


45 


21q22.11-q22.13 



99-2610 


7 


185 


61.7 


21q22.11-q22.13 


99-2615 


7 


185 


61.7 


21q22.11-q22.13 


99-2620 


7 


185 


61.7 


21q22.l1*q22.13 



99*2645 


8 


250 


83.3 


21q22.11-q22.13 


99-2647 


8 


250 


83.3 


21q22.11-q22.13 


99-2649 


8 


250 


83.3 


21q22.11-q22.13 



99-2333 


9 


140 


46.7 


21q22.11-q22.13 


99-2341 


9 


140 


4G.7 


* 2lq22.11-q22.13 


99-2342 


9 


140 


46.7 


21q22.11-q22.13 




99-2240 


10 


95 


13.6 


21q22.11-q22.13 


99-2242 


10 


95 


13.6 


21q22.11*q22.13 


99-2244 


10 


95 


13.6 


21q22.11-q22.13 


99-2246 


10 


95 


13.5 


21q22.11-q22.13 


99-2248 


10 


95 


13.6 


21q22.11-q22.13 


99-2250 


10 


95 


13.6 


21q22.11-q22.13 


99-2251 


10 


95 


13.6 


21q22.11-q22.13 



99-2269 


11 


40 


6.7 


21q22.11-q22.13 


99-2271 


11 


40 


6.7 


21q22.11-q22.13 


99-2272 


11 


40 


6.7 


21q22,11-q22. 13 


99-2273 


11 


40 


6.7 


21q22.11-q22.13 


99-2275 


11 


40 


6.7 


21q22.11-q22.13 


99-2278 


11 


40 


6.7 


21q22,11-q22.13 



99-2624 


12 


165 


23.6 


21q22.2 


99-2625 


12 


165 


23.6 


21q22.2 


99-2630 


12 


165 


23.6 


21q22.2 


99-2633 


12 


165 . 


23.6 


21q22.2 


99-2634 


12 


165 


23.6 


21q22.2 


99-2637 


12 


165 


23.6 


21q22.2 


99-2642 


12 


165 


23.6 


21q22.2 



99-2559 


13 


205 


41 


21q22.3 


99-2566 


13 


205 


41 


21q22.3 


99-2567 


13 


205 


41 


21q22.3 


99-2570 


13 


205 


41 


21q22.3 


99-2571 


13 


205 


41 


21q22.3 



